Entity Resolution

Entity resolution is the process of turning a string like “Feynman” into a stable identifier like Q39246 — the Wikidata QID for Richard Feynman the physicist, not Roger Feynman the mathematician, not “Feynman, Alabama”. This page explains how Daneel performs that mapping when you click a node in the knowledge graph, what the trade-offs are, and why the implementation fans out across two APIs instead of picking one.

Why it is hard

Text is ambiguous in ways databases are not. A single surface form can refer to many entities:

“Paris” — the French capital, the Greek mythological figure, the city in Texas, the Paris hotel in Las Vegas
“Einstein” — Albert, but also a dozen other Einsteins on Wikidata
“Apple” — the fruit, the company, the record label, the Manhattan neighborhood

A knowledge-graph node gives us two extra pieces of context the raw string does not carry: the ontology type (is this a person, a place, a company?) and the other entities that co-occur with it in the source documents. Good resolution uses both.

Daneel’s own NER extractor assigns an ontology label when it identifies an entity. Labels like person, city, organization, book, war come from the vault’s ontology (either a preset or a user-defined set). The fact-box resolver uses this label to narrow the search space on Wikidata.

Two signals, one QID

The fact-box panel resolves entities by querying two services in parallel:

OpenRefine Reconciliation API — wikidata-reconciliation.wmcloud.org/en/api

Purpose-built for “free-text string plus optional type filter, rank the candidates”. Accepts a Wikidata class QID (e.g., Q5 = human, Q515 = city) as the type parameter. When the entity’s ontology label maps to a known Wikidata class, the reconciliation service receives that class and returns only candidates of the matching type — “Feynman” as a person, not “Feynman” as a place. Scores come back on a [0, 100] scale.

Wikidata Search API — wbsearchentities

A general-purpose label and alias matcher. No type filter. Returns the top candidates by text similarity. Useful as a fallback when the ontology label doesn’t map to a clean Wikidata class, and as a coverage net when reconciliation misses.

Both calls fire in parallel. When they return, the candidates are merged by QID, scores are reconciled, and the list is sorted.

Scoring and the auto-select threshold

Candidates that appear in both sources keep reconciliation’s authoritative score. Candidates that appear only in the general search get a synthetic score, highest for the top hit (0.6 by default, with a boost to 0.7 when reconciliation returned nothing at all), decaying for lower ranks.

If the top candidate’s final score crosses the auto-select threshold (0.85 by default), the panel commits to that QID without asking. This cuts a click out of the flow for unambiguous names like “Albert Einstein” or “Napoleon Bonaparte”.

When no candidate crosses the threshold, the panel shows a disambiguation picker instead. This is the common case for short last names, common given names, and anything polysemous. The user sees the top 5 candidates with their QIDs, confidence scores, and one-line descriptions, and picks the right one.

The picker’s choice is cached for 30 days and keyed on the surface text and ontology type. Future clicks on the same entity in any vault resolve instantly.

Why both sources instead of one

It is tempting to pick one API and be done. We tried this in planning, and neither works alone:

Reconciliation alone misses entities whose Wikidata class is not well covered, or whose ontology label doesn’t map to a clean class. Users with custom ontologies like research_paper or compound would resolve nothing.
Search alone has no type awareness. “Paris” asked for a city returns a mix of the city, the myth, and several American towns, with no way to bias toward the geographic sense.

The merge handles both weaknesses. Reconciliation narrows; search widens. Combined, coverage is broader than either source alone, and the type filter still gives reconciliation-aware queries the precision they need.

What happens after resolution

Once a QID is committed (either auto-select or user pick), the panel fetches the full entity payload with wbgetentities. The response is simplified with wikibase-sdk’s simplifyClaims helper into a flat structure keyed by property ID, with qualifiers and references kept where useful.

Statement values come back as either a QID (for item-typed claims) or a primitive (for strings, dates, quantities, URLs). QID-valued statements need labels, so the panel collects every referenced QID and property ID and fetches human-readable labels in a single batch via wbformatentities. This batches up to 50 IDs per request, so a typical entity with 15 to 30 referenced QIDs resolves in one follow-up call.

The resulting label map is shared across every fact box in the same session. Seeing “Princeton University” once caches it for every future entity that references Q21578.

Why this is a good fit for knowledge graphs

A knowledge graph built from your documents captures what is in your documents. A knowledge graph built from Wikidata captures what is true about the wider world. The fact box bridges the two: it takes your graph’s nodes, connects each one to its canonical Wikidata identity, and surfaces the structured facts that Wikidata knows but your documents never mentioned.

This matters most when your documents assume background knowledge. A research paper citing “Bell” expects you to know it means John Stewart Bell, the physicist behind Bell’s theorem. A historical document mentioning “Agincourt” expects you to know roughly when and where. The fact box fills in that background on demand, without you having to leave the graph view.

It is also the foundation for the upcoming factual-edges layer. Once every node in your vault has a resolved QID, a single SPARQL query can pull real relationships from Wikidata — “X was educated at Y”, “A founded B”, “C is located in D” — and render them as a distinct edge type in the 3D graph, alongside the co-occurrence edges you already see. That work uses the same resolution pipeline described here; the QIDs cached today become the input to the graph augmentation tomorrow.

Trade-offs we accepted

English only in v1. Reconciliation runs against the English endpoint, entity fetching pulls English labels, the label cache is lang-unaware. Multilingual resolution is a separate, bigger piece of work.
No semantic fallback by default. We researched using wd-vectordb, Wikimedia’s hybrid vector-keyword search, as a third signal. The service is promising but slow and fuzzy compared to reconciliation on the entity types we care most about. It may land as an optional backstop later.
The user is the tiebreaker. When the two-source merge leaves multiple candidates in a dead heat, we show a picker rather than guess. The cached pick becomes our durable ground truth.

How to Use the Wikidata Fact Box — the task-oriented guide to the feature.
Knowledge Graphs — how the graph itself is built from your documents.
Privacy Model — how external lookups fit into the data-residency picture.