Skip to content

Supported File Formats

To get started with document import, see Build a Document Vault.

FormatExtensionsConversion methodNotes
PDF.pdfEdgeParse WASMStructured Markdown extraction (no OCR for scanned PDFs)
Microsoft Word.docxMammoth (DOCX → HTML → text)Modern .docx only, not legacy .doc
Plain text.txtDirect readUTF-8 assumed
HTML.html, .htmTurndown (HTML → Markdown)Strips scripts, styles, and navigation
PowerPoint.pptxText extraction from slidesSlide text only, no speaker notes or images
Excel.xls, .xlsxCell text extractionText content from cells, not formulas
Markdown.mdDirect readPreserved as-is
  1. Format detection — Daneel infers the format from the file extension.
  2. Conversion — The CompositeConverter delegates to the appropriate converter (PdfConverter, DocxConverter, or HtmlDocumentConverter).
  3. Text output — All formats are converted to plain text or Markdown.
  4. Chunking — The text is split into overlapping chunks (default: 512 tokens, 64 token overlap).
  5. Embedding — Each chunk is embedded using the active embedding model.
  6. Deduplication — SHA-256 content hash prevents duplicate imports.
LimitFree planPaid plan
Max file size1 MB10 MB
Max converted characters50,000500,000
Max chunks per document1001,000
Max documents per vault550
Max vaults1Unlimited

For web pages (Page Chat and Site Search modes), Daneel uses a three-strategy extraction pipeline:

  1. Readability.js — Mozilla’s reader-mode extractor. Best for articles and blog posts.
  2. CSS cascade + Turndown — Selects the main content area via CSS heuristics, converts to Markdown. Used when Readability fails.
  3. Plain-text fallback — Strips all HTML and returns raw text. Last resort.

YouTube pages use a separate transcript extraction pipeline.