RAG (Retrieval-Augmented Generation) is the technique Daneel uses to answer questions about websites and documents. Instead of sending an entire site to the AI, Daneel finds the most relevant passages first, then asks the AI to answer based on those passages.

This is what powers [Site Search](/guides/first-site-index/) and [Document Vault](/guides/first-vault/).

## The pipeline

RAG in Daneel has four stages:

### 1. Content acquisition

**For sites:** Daneel discovers sitemaps, crawls pages (BFS up to configurable depth and page count), and extracts text using a three-strategy pipeline: Readability.js for articles, CSS cascade + Turndown for structured pages, plain-text fallback for everything else.

**For vaults:** You import files directly. Daneel converts them to text using format-specific converters (EdgeParse for PDFs, Mammoth for DOCX, Turndown for HTML).

**For pages:** The content script extracts the current page's text in real time. YouTube videos get special treatment — Daneel fetches the transcript via the InnerTube API.

### 2. Chunking

Raw text is split into overlapping chunks. Daneel uses recursive chunking (via Chonkie) with these defaults:

- **Chunk size:** 512 tokens
- **Overlap:** 64 tokens

Overlap ensures that important context near chunk boundaries isn't lost. You can adjust chunk size in [Settings > Indexes](/reference/settings/#indexes).

A single page produces up to 200 chunks (configurable up to 2,000). This cap prevents a single massive page from dominating the index.

### 3. Embedding

Each chunk is converted to a vector (a list of numbers) using an embedding model. Daneel runs this locally using the BGE Small EN v1.5 model on WebGPU:

- **384 dimensions** per vector
- **fp16 quantization** for GPU performance
- **Batch size of 32** to prevent GPU memory issues

The embedding model runs entirely in your browser. Even when you use Claude or Azure for the LLM, embeddings are always local.

Vectors are stored in IndexedDB, partitioned by domain (for sites) or vault ID (for documents).

### 4. Retrieval + generation

When you ask a question:

1. Your question is embedded using the same model.
2. Cosine similarity finds the top-k most similar chunks (default: 50 candidates, narrowed to 15 source URLs).
3. A keyword boost (15% weight) supplements semantic similarity — this helps with exact term matches that embedding models sometimes miss.
4. Chunks scoring below a minimum threshold (0.6) are filtered out.
5. The top chunks are assembled into a prompt with source URLs.
6. The prompt + your question go to the active LLM provider for answer generation.

The AI sees the relevant context and source links, then generates a grounded response. You see the answer with clickable source references.

## Why local embedding matters

Because embedding runs locally, your documents and site content are never sent to a cloud service for indexing. The only time content reaches a cloud provider is during step 4, when the assembled prompt (selected chunks, not the full corpus) goes to the LLM. And even that step is optional — use WebGPU or Ollama to keep everything local.

For more on data flow, see [Privacy Model](/concepts/privacy/).

## GPU-accelerated search

For large indexes (50,000+ chunks), Daneel uses GPU-accelerated cosine similarity search. This keeps search times under 5ms even at scale, compared to sequential CPU search which would take noticeably longer.

## Trade-offs

**Chunk size** is a balance between context and precision. Larger chunks provide more context per result but may include irrelevant text. Smaller chunks are more precise but may miss surrounding context. The 512-token default works well for most content.

**Top-k** controls how many chunks the AI sees. More chunks give the AI more information but increase prompt size (and cost, for token-billed providers). The default of 15 source URLs strikes a balance.

**Embedding model** quality affects retrieval accuracy. BGE Small is compact and fast but less accurate than larger models. This is a deliberate trade-off for a browser extension where model size matters.