How to Index a Site

Daneel can index an entire website so you can search and ask questions about its content. There are two discovery methods: Sitemap, which reads the site’s sitemap.xml, and Web Crawl, which discovers pages by following links. This guide covers when to use each and how to configure them.

Prerequisites

Daneel installed and an AI provider configured (any provider works for embedding)
Navigate to the site you want to index

Open the Site panel

Click the Daneel icon on any page, then open the Site tab (magnifying glass icon). Daneel automatically checks for sitemaps when the panel opens.

Choose a discovery method

After the sitemap check completes, you’ll see two options:

Sitemap

Best for sites that maintain a sitemap.xml. Daneel discovers sitemaps automatically from robots.txt and standard locations (/sitemap.xml, /sitemap_index.xml). It also checks path-level candidates based on your current URL.

When sitemaps are found:

Review the discovered sitemaps in the checklist. Each entry shows the URL and estimated page count.
Uncheck any sitemaps you don’t want to include.
Set Max pages and Depth, then click Crawl.

Web Crawl

Best for sites without a sitemap, or when the sitemap is incomplete. The crawler starts from your current page and discovers content by following every link it finds in the HTML, breadth-first.

When no sitemap is found, Daneel automatically selects Web Crawl. You can also switch to it manually when sitemaps exist but don’t cover the full site.

Select the Web Crawl card.
Optionally set a path prefix to limit the crawl scope (see below).
Set Max pages and Depth, then click Crawl.

Use the path prefix filter

When Web Crawl is selected, a Path prefix field appears. Daneel infers a prefix from your current URL. For example, if you’re on example.com/docs/getting-started, the prefix is set to /docs.

The crawler only follows links whose path starts with this prefix. This keeps the crawl focused on a section of the site instead of indexing everything.

Edit the prefix to narrow or widen the scope
Click the x button to clear it entirely and crawl the whole site

Crawl settings

Setting	Default	Description
Max pages	50	Maximum pages to fetch in this crawl session
Depth	3	For sitemap: nesting depth of sitemap indexes. For web crawl: BFS hops from the starting page

What happens during a crawl

Once you click Crawl, the task runs in the background:

Discovery finds page URLs (from sitemap or by following links)
Extraction converts each page’s HTML to clean Markdown using Readability
Chunking splits the Markdown into overlapping segments
Embedding converts each chunk to a vector using the active embedding model
Storage saves vectors to IndexedDB, partitioned by domain

A progress bar shows crawl and embedding progress. You can close the panel or navigate away; the task continues. See Monitor Background Tasks for details.

Cancel a crawl

Click Cancel next to the progress bar, or go to Settings > Tasks and stop the task from there.

After indexing

Once the crawl completes, the Site panel switches to search view. Type a question to search across all indexed pages. Results are ranked by semantic similarity and include source links.

To re-index, clear, or manage stored data, see Manage Site Indexes.

Web Crawl safety guards

The web crawler includes several protections against runaway crawls:

Same-origin only: links to other domains are discovered but not followed
Query normalization: pagination parameters (page, offset, cursor, etc.) are stripped, so /results?page=1 and /results?page=2 are treated as the same URL
Path depth cap: URLs with more than 10 path segments are skipped
Queue limit: the internal queue is capped at 3x the max pages setting
Retry with backoff: server errors (5xx) are retried up to twice with exponential backoff; client errors (4xx) are skipped immediately
Optional robots.txt: when enabled, the crawler respects User-agent: * Disallow rules

Next steps

Learn how RAG works under the hood
Build a Document Vault for local files
Monitor Background Tasks to track crawl progress