Supported File Formats

To get started with document import, see Build a Document Vault.

Supported formats

Format	Extensions	Conversion method	Notes
PDF	`.pdf`	EdgeParse WASM	Structured Markdown extraction (no OCR for scanned PDFs)
Microsoft Word	`.docx`	Mammoth (DOCX → HTML → text)	Modern `.docx` only, not legacy `.doc`
Plain text	`.txt`	Direct read	UTF-8 assumed
HTML	`.html`, `.htm`	Turndown (HTML → Markdown)	Strips scripts, styles, and navigation
PowerPoint	`.pptx`	Text extraction from slides	Slide text only, no speaker notes or images
Excel	`.xls`, `.xlsx`	Cell text extraction	Text content from cells, not formulas
Markdown	`.md`	Direct read	Preserved as-is

Format detection — Daneel infers the format from the file extension.
Conversion — The CompositeConverter delegates to the appropriate converter (PdfConverter, DocxConverter, or HtmlDocumentConverter).
Text output — All formats are converted to plain text or Markdown.
Chunking — The text is split into overlapping chunks (default: 512 tokens, 64 token overlap).
Embedding — Each chunk is embedded using the active embedding model.
Deduplication — SHA-256 content hash prevents duplicate imports.

For web pages (Page Chat and Site Search modes), Daneel uses a three-strategy extraction pipeline:

Readability.js — Mozilla’s reader-mode extractor. Best for articles and blog posts.
CSS cascade + Turndown — Selects the main content area via CSS heuristics, converts to Markdown. Used when Readability fails.
Plain-text fallback — Strips all HTML and returns raw text. Last resort.