Supported File Formats
To get started with document import, see Build a Document Vault.
Supported formats
Section titled “Supported formats”| Format | Extensions | Conversion method | Notes |
|---|---|---|---|
.pdf | EdgeParse WASM | Structured Markdown extraction (no OCR for scanned PDFs) | |
| Microsoft Word | .docx | Mammoth (DOCX → HTML → text) | Modern .docx only, not legacy .doc |
| Plain text | .txt | Direct read | UTF-8 assumed |
| HTML | .html, .htm | Turndown (HTML → Markdown) | Strips scripts, styles, and navigation |
| PowerPoint | .pptx | Text extraction from slides | Slide text only, no speaker notes or images |
| Excel | .xls, .xlsx | Cell text extraction | Text content from cells, not formulas |
| Markdown | .md | Direct read | Preserved as-is |
Conversion pipeline
Section titled “Conversion pipeline”- Format detection — Daneel infers the format from the file extension.
- Conversion — The
CompositeConverterdelegates to the appropriate converter (PdfConverter, DocxConverter, or HtmlDocumentConverter). - Text output — All formats are converted to plain text or Markdown.
- Chunking — The text is split into overlapping chunks (default: 512 tokens, 64 token overlap).
- Embedding — Each chunk is embedded using the active embedding model.
- Deduplication — SHA-256 content hash prevents duplicate imports.
Size limits
Section titled “Size limits”| Limit | Free plan | Paid plan |
|---|---|---|
| Max file size | 1 MB | 10 MB |
| Max converted characters | 50,000 | 500,000 |
| Max chunks per document | 100 | 1,000 |
| Max documents per vault | 5 | 50 |
| Max vaults | 1 | Unlimited |
Page content extraction
Section titled “Page content extraction”For web pages (Page Chat and Site Search modes), Daneel uses a three-strategy extraction pipeline:
- Readability.js — Mozilla’s reader-mode extractor. Best for articles and blog posts.
- CSS cascade + Turndown — Selects the main content area via CSS heuristics, converts to Markdown. Used when Readability fails.
- Plain-text fallback — Strips all HTML and returns raw text. Last resort.
YouTube pages use a separate transcript extraction pipeline.