How to Index a Site
Daneel can index an entire website so you can search and ask questions about its content. There are two discovery methods: Sitemap, which reads the site’s sitemap.xml, and Web Crawl, which discovers pages by following links. This guide covers when to use each and how to configure them.
Prerequisites
Section titled “Prerequisites”- Daneel installed and an AI provider configured (any provider works for embedding)
- Navigate to the site you want to index
Open the Site panel
Section titled “Open the Site panel”Click the Daneel icon on any page, then open the Site tab (magnifying glass icon). Daneel automatically checks for sitemaps when the panel opens.
Choose a discovery method
Section titled “Choose a discovery method”After the sitemap check completes, you’ll see two options:
Sitemap
Section titled “Sitemap”Best for sites that maintain a sitemap.xml. Daneel discovers sitemaps automatically from robots.txt and standard locations (/sitemap.xml, /sitemap_index.xml). It also checks path-level candidates based on your current URL.
When sitemaps are found:
- Review the discovered sitemaps in the checklist. Each entry shows the URL and estimated page count.
- Uncheck any sitemaps you don’t want to include.
- Set Max pages and Depth, then click Crawl.
Web Crawl
Section titled “Web Crawl”Best for sites without a sitemap, or when the sitemap is incomplete. The crawler starts from your current page and discovers content by following every link it finds in the HTML, breadth-first.
When no sitemap is found, Daneel automatically selects Web Crawl. You can also switch to it manually when sitemaps exist but don’t cover the full site.
- Select the Web Crawl card.
- Optionally set a path prefix to limit the crawl scope (see below).
- Set Max pages and Depth, then click Crawl.
Use the path prefix filter
Section titled “Use the path prefix filter”When Web Crawl is selected, a Path prefix field appears. Daneel infers a prefix from your current URL. For example, if you’re on example.com/docs/getting-started, the prefix is set to /docs.
The crawler only follows links whose path starts with this prefix. This keeps the crawl focused on a section of the site instead of indexing everything.
- Edit the prefix to narrow or widen the scope
- Click the x button to clear it entirely and crawl the whole site
Crawl settings
Section titled “Crawl settings”| Setting | Default | Description |
|---|---|---|
| Max pages | 50 | Maximum pages to fetch in this crawl session |
| Depth | 3 | For sitemap: nesting depth of sitemap indexes. For web crawl: BFS hops from the starting page |
What happens during a crawl
Section titled “What happens during a crawl”Once you click Crawl, the task runs in the background:
- Discovery finds page URLs (from sitemap or by following links)
- Extraction converts each page’s HTML to clean Markdown using Readability
- Chunking splits the Markdown into overlapping segments
- Embedding converts each chunk to a vector using the active embedding model
- Storage saves vectors to IndexedDB, partitioned by domain
A progress bar shows crawl and embedding progress. You can close the panel or navigate away; the task continues. See Monitor Background Tasks for details.
Cancel a crawl
Section titled “Cancel a crawl”Click Cancel next to the progress bar, or go to Settings > Tasks and stop the task from there.
After indexing
Section titled “After indexing”Once the crawl completes, the Site panel switches to search view. Type a question to search across all indexed pages. Results are ranked by semantic similarity and include source links.
To re-index, clear, or manage stored data, see Manage Site Indexes.
Web Crawl safety guards
Section titled “Web Crawl safety guards”The web crawler includes several protections against runaway crawls:
- Same-origin only: links to other domains are discovered but not followed
- Query normalization: pagination parameters (
page,offset,cursor, etc.) are stripped, so/results?page=1and/results?page=2are treated as the same URL - Path depth cap: URLs with more than 10 path segments are skipped
- Queue limit: the internal queue is capped at 3x the max pages setting
- Retry with backoff: server errors (5xx) are retried up to twice with exponential backoff; client errors (4xx) are skipped immediately
- Optional robots.txt: when enabled, the crawler respects
User-agent: *Disallow rules
Next steps
Section titled “Next steps”- Learn how RAG works under the hood
- Build a Document Vault for local files
- Monitor Background Tasks to track crawl progress