
Daneel can index an entire website so you can search and ask questions about its content. There are two discovery methods: **Sitemap**, which reads the site's `sitemap.xml`, and **Web Crawl**, which discovers pages by following links. This guide covers when to use each and how to configure them.

## Prerequisites

- Daneel installed and an AI provider configured (any provider works for embedding)
- Navigate to the site you want to index

## Open the Site panel

Click the Daneel icon on any page, then open the **Site** tab (magnifying glass icon). Daneel automatically checks for sitemaps when the panel opens.

## Choose a discovery method

After the sitemap check completes, you'll see two options:

### Sitemap

Best for sites that maintain a `sitemap.xml`. Daneel discovers sitemaps automatically from `robots.txt` and standard locations (`/sitemap.xml`, `/sitemap_index.xml`). It also checks path-level candidates based on your current URL.

When sitemaps are found:

1. Review the discovered sitemaps in the checklist. Each entry shows the URL and estimated page count.
2. Uncheck any sitemaps you don't want to include.
3. Set **Max pages** and **Depth**, then click **Crawl**.

### Web Crawl

Best for sites without a sitemap, or when the sitemap is incomplete. The crawler starts from your current page and discovers content by following every link it finds in the HTML, breadth-first.

When no sitemap is found, Daneel automatically selects Web Crawl. You can also switch to it manually when sitemaps exist but don't cover the full site.

1. Select the **Web Crawl** card.
2. Optionally set a **path prefix** to limit the crawl scope (see below).
3. Set **Max pages** and **Depth**, then click **Crawl**.

## Use the path prefix filter

When Web Crawl is selected, a **Path prefix** field appears. Daneel infers a prefix from your current URL. For example, if you're on `example.com/docs/getting-started`, the prefix is set to `/docs`.

The crawler only follows links whose path starts with this prefix. This keeps the crawl focused on a section of the site instead of indexing everything.

- Edit the prefix to narrow or widen the scope
- Click the **x** button to clear it entirely and crawl the whole site

## Crawl settings

| Setting | Default | Description |
|---------|---------|-------------|
| Max pages | 50 | Maximum pages to fetch in this crawl session |
| Depth | 3 | For sitemap: nesting depth of sitemap indexes. For web crawl: BFS hops from the starting page |

## What happens during a crawl

Once you click Crawl, the task runs in the background:

1. **Discovery** finds page URLs (from sitemap or by following links)
2. **Extraction** converts each page's HTML to clean Markdown using Readability
3. **Chunking** splits the Markdown into overlapping segments
4. **Embedding** converts each chunk to a vector using the active embedding model
5. **Storage** saves vectors to IndexedDB, partitioned by domain

A progress bar shows crawl and embedding progress. You can close the panel or navigate away; the task continues. See [Monitor Background Tasks](/how-to/background-tasks/) for details.

## Cancel a crawl

Click **Cancel** next to the progress bar, or go to **Settings > Tasks** and stop the task from there.

## After indexing

Once the crawl completes, the Site panel switches to search view. Type a question to search across all indexed pages. Results are ranked by semantic similarity and include source links.

To re-index, clear, or manage stored data, see [Manage Site Indexes](/how-to/manage-indexes/).

## Web Crawl safety guards

The web crawler includes several protections against runaway crawls:

- **Same-origin only**: links to other domains are discovered but not followed
- **Query normalization**: pagination parameters (`page`, `offset`, `cursor`, etc.) are stripped, so `/results?page=1` and `/results?page=2` are treated as the same URL
- **Path depth cap**: URLs with more than 10 path segments are skipped
- **Queue limit**: the internal queue is capped at 3x the max pages setting
- **Retry with backoff**: server errors (5xx) are retried up to twice with exponential backoff; client errors (4xx) are skipped immediately
- **Optional robots.txt**: when enabled, the crawler respects `User-agent: *` Disallow rules

## Next steps

- Learn [how RAG works](/concepts/rag/) under the hood
- [Build a Document Vault](/guides/first-vault/) for local files
- [Monitor Background Tasks](/how-to/background-tasks/) to track crawl progress
