Skip to content

Graph Analytics

A knowledge graph by itself is just a picture: nodes for entities, edges for connections. Daneel’s analytics layer turns that picture into a structured set of insights — which entities matter most, which ones bridge separate ideas, what topical clusters exist, and whether the graph is healthy.

This page explains what each insight means, why it’s useful, and what trade-offs come with the approach. To use these features in the extension, see How to Explore Your Knowledge Graph.

Visualizing 5,000 entities in 3D space is impressive but not directly useful. You can rotate it, zoom it, watch the physics simulation settle, and still walk away with no sense of what’s important. The graph needs to be summarized before it can be understood.

The analytics layer answers four questions a curious user would naturally ask:

  1. What are the most important entities? → Key Entities (importance ranking)
  2. What topics exist in this corpus? → Topics (community detection)
  3. What connects different topics? → Bridges (bridge detection)
  4. How is X related to Y? → Path Finder (shortest path search)

A fifth question — is my graph healthy? — is answered by the structural diagnostics that flag fragmentation and possible duplicate entities.

All these questions have well-known answers from the field of graph theory. Daneel’s analytics layer wraps the graphology library to compute them.

The “Key Entities” ranking uses PageRank, the same algorithm Google originally used to rank web pages. The intuition is recursive: an entity is important if it’s connected to other important entities.

A famous person mentioned only twice in your corpus, but in passages with many other important entities, can outrank a generic term mentioned hundreds of times in passing. PageRank rewards being part of the conversation, not being repeated.

This is different from raw mention counts. If your vault contains physics papers and “the” is mentioned 10,000 times, mention count is useless. PageRank looks at the network structure instead.

Trade-off: PageRank assumes the graph is connected enough for the recursion to flow. On heavily fragmented graphs, isolated clusters get inflated scores within their own bubble. Daneel’s “Graph Health” card warns you when this is happening.

Topics: groups of things that travel together

Section titled “Topics: groups of things that travel together”

Topics come from Louvain community detection, an algorithm that partitions a graph into clusters by maximizing internal connections relative to external connections. Two entities end up in the same community if they appear together more often than chance would predict.

In a knowledge graph from NER extraction, communities tend to correspond to:

  • Research domains in academic corpora (a “machine learning” cluster, a “molecular biology” cluster)
  • Geographic regions in news or travel content
  • Historical periods in history documents
  • Product lines in technical documentation
  • Topical themes in mixed-domain corpora

Each community gets an auto-generated label by combining its top two entities (e.g., “Einstein & Relativity”). The label is heuristic — it’s just a hint, not a semantic summary.

Trade-off: Community detection has no notion of meaning. It only looks at connection patterns. Two completely unrelated topics that happen to share a few documents can end up bundled into one community. The clustering is also non-deterministic — running it twice can produce slightly different communities.

Bridges come from betweenness centrality, which measures how often an entity sits on the shortest path between other pairs of entities. High betweenness means: “if you removed this entity, the graph would fragment.”

These are the connectors: entities that link otherwise separate communities. In a research corpus, a bridge might be an interdisciplinary scholar whose work spans two fields. In a business corpus, it might be a parent company that owns brands in different sectors.

Bridges are often the most interesting entities in a knowledge graph because they reveal non-obvious connections. The most-mentioned entity is usually the obvious one — but the bridge between Topic A and Topic B is something you wouldn’t have spotted by reading documents one at a time.

Trade-off: Betweenness is the most computationally expensive analytic — it scales as O(n × m). On graphs with thousands of nodes it’s still fast (a few hundred milliseconds), but it’s the reason analytics are computed once and cached, not on every interaction.

The Path Finder answers the most human question of all: “given these two entities, is there any connection between them, and what’s the shortest one?”

Daneel uses Dijkstra’s shortest path algorithm with one important twist: edges with stronger co-occurrence (more shared documents) count as shorter distances. Mathematically, the distance is 1 / weight. So a path through highly co-occurring entities is preferred over a path through weak ones, even if the weak path has fewer hops.

The result is a chain like:

Einstein → General Relativity → Eddington → Royal Society

Each step in the chain comes with the chunk IDs that established the link, so you can trace the path back to the source documents.

If multiple paths of the same length exist, Daneel returns up to 5 alternatives. This reveals different “stories” connecting the same two entities — Einstein might connect to the Nobel Prize through one collaborator and to the same prize through a different one.

Trade-off: Path search is constrained by the graph itself. If two entities are in different connected components (different islands), there’s no path. The Graph Health card tells you when fragmentation is severe enough to make this likely.

The Graph Health card combines three structural measures into a single traffic-light status:

  • Connected components — how many disjoint islands the graph contains. A healthy graph has one big island and maybe a few small satellites.
  • Largest component percent — what fraction of entities are in the biggest island. A healthy graph has 70%+ in the main cluster.
  • Possible duplicates — entity pairs that look textually similar but weren’t merged. Detected via the same heuristics the entity resolver uses (substring containment, edit distance, reversed word order).

A heavily fragmented graph (many small components) usually means one of two things:

  1. Entity resolution failed — the same entity exists under multiple slightly different names, splitting what should be one connected blob into many small ones. The “Possible duplicates” list flags these cases.
  2. Your corpus is genuinely disjoint — you imported documents about completely unrelated topics. The graph correctly reflects that.

The duplicates list is actionable: each entry shows the pair, the type, and the reason they look similar. You can use this to spot recurring entity-resolution failures and adjust your ontology or model accordingly.

Visual encoding: sizing and coloring modes

Section titled “Visual encoding: sizing and coloring modes”

The 3D visualization separates what’s measured (analytics) from how it’s drawn (sizing and coloring). You can swap either dimension without recomputing the underlying metrics.

Node size can represent:

  • Mentions (default) — how often the entity appears, log-scaled
  • Importance — PageRank score
  • Bridges — betweenness centrality
  • Connectivity — degree (raw connection count)

The size uses a square-root curve over a wide range so the natural power-law shape of centrality metrics stays visible — a handful of dramatic outliers stay dramatic instead of being flattened into uniform blobs.

Node color can represent:

  • Type (default) — the entity’s ontology type (person, location, organization, etc.)
  • Topic — which Louvain community it belongs to

Switching color modes also switches the bottom-left legend. In topic-color mode, the legend lists communities and lets you click one to filter the view.

The decoupling matters because the same graph tells different stories depending on what you ask. “Color by type” reveals the ontological mix. “Color by topic” reveals thematic clusters. “Size by importance” reveals influential entities. “Size by bridges” reveals structural connectors. Combine them however you like.

The graph tells you what’s in your corpus. Wikipedia tells you what the world knows about each entity. Daneel bridges the two: clicking any node in the 3D view triggers a Wikipedia search alongside the focus action.

The lookup uses Wikipedia’s prefix-search API to find pages whose titles match the entity name, returns up to 10 candidates with thumbnails and short descriptions, and lets you either:

  • Open an article in the document viewer (Daneel fetches the page, converts the HTML to markdown, and renders it inline)
  • Open the article on wikipedia.org in a new tab

The first option is useful because the article appears alongside the graph — you can read about an entity without losing your place in the visualization.

Results are cached in chrome.storage.local for 7 days, so repeat lookups don’t hit the Wikipedia API again. Empty results (no matching articles) are cached for 1 day to avoid repeatedly retrying broken queries.

Trade-off: Wikipedia is general-purpose. If your corpus uses domain-specific terminology that Wikipedia doesn’t cover, the lookup may return empty or irrelevant results. The disambiguation pages (e.g., five different “John Tate”s) help with the opposite problem — when an entity name maps to multiple real-world things and you need to pick the right one for context.

The analytics layer is an interpretation layer between raw extraction (NER + entity resolution) and human exploration (the 3D viz, panels, and Wikipedia). Each piece serves a different question, and the cross-references between them are the point — clicking a topic filters the graph, clicking an entity opens its neighborhood and Wikipedia, clicking a path highlights the connection chain.

You don’t need to know which algorithm is running underneath. The point is that “Importance”, “Topics”, “Bridges”, “Path”, and “Health” map to questions you’d ask about any large network of things, and Daneel answers them with well-understood graph theory operating entirely in your browser.