
To get started with document import, see [Build a Document Vault](/guides/first-vault/).

## Supported formats

| Format | Extensions | Conversion method | Notes |
|--------|-----------|-------------------|-------|
| PDF | `.pdf` | EdgeParse WASM | Structured Markdown extraction (no OCR for scanned PDFs) |
| Microsoft Word | `.docx` | Mammoth (DOCX → HTML → text) | Modern `.docx` only, not legacy `.doc` |
| Plain text | `.txt` | Direct read | UTF-8 assumed |
| HTML | `.html`, `.htm` | Turndown (HTML → Markdown) | Strips scripts, styles, and navigation |
| PowerPoint | `.pptx` | Text extraction from slides | Slide text only, no speaker notes or images |
| Excel | `.xls`, `.xlsx` | Cell text extraction | Text content from cells, not formulas |
| Markdown | `.md` | Direct read | Preserved as-is |

## Conversion pipeline

1. **Format detection** — Daneel infers the format from the file extension.
2. **Conversion** — The `CompositeConverter` delegates to the appropriate converter (PdfConverter, DocxConverter, or HtmlDocumentConverter).
3. **Text output** — All formats are converted to plain text or Markdown.
4. **Chunking** — The text is split into overlapping chunks (default: 512 tokens, 64 token overlap).
5. **Embedding** — Each chunk is embedded using the active embedding model.
6. **Deduplication** — SHA-256 content hash prevents duplicate imports.

## Size limits

| Limit | Free plan | Paid plan |
|-------|-----------|-----------|
| Max file size | 1 MB | 10 MB |
| Max converted characters | 50,000 | 500,000 |
| Max chunks per document | 100 | 1,000 |
| Max documents per vault | 5 | 50 |
| Max vaults | 1 | Unlimited |

## Page content extraction

For web pages (Page Chat and Site Search modes), Daneel uses a three-strategy extraction pipeline:

1. **Readability.js** — Mozilla's reader-mode extractor. Best for articles and blog posts.
2. **CSS cascade + Turndown** — Selects the main content area via CSS heuristics, converts to Markdown. Used when Readability fails.
3. **Plain-text fallback** — Strips all HTML and returns raw text. Last resort.

YouTube pages use a separate [transcript extraction pipeline](/how-to/youtube-chat/).
