AI Providers

To get started with a provider, see Connect a Cloud Provider. For a conceptual comparison, see The Provider Spectrum.

LLM Providers

Daneel supports five LLM backends. All implement the same interface — switching providers changes the AI brain without affecting the rest of the experience.

WebGPU (Local)

Runs AI models directly on your GPU using WebGPU and ONNX Runtime. No server, no API key, no internet after first model download.

Property	Value
Data residency	On-device
Internet required	No (after model cache)
Tool calling	Experimental (prompt-based XML tags)
Streaming	Yes
Thinking/reasoning	Yes (model-dependent)
Cost	Free

Available models:

Models are auto-selected based on your GPU capabilities. The catalog includes 20+ models from Liquid AI, Microsoft, HuggingFace, PrismML, DeepSeek, Zhipu, Alibaba, Meta, Google, and IBM, ranging from 350M to 3B+ parameters. Open Settings > AI Models to browse the full catalog with hardware compatibility scores.

Default model: Granite 4.0 Micro 3B (q4f16).

Quantization formats: Models ship in various quantization levels. Most use q4 (4-bit), which balances quality and size. Some models also offer q4f16 (4-bit with fp16 compute, requires shader-f16), q8 (8-bit), q2 (2-bit), and q1 (1-bit). The 1-bit and 2-bit formats are new, enabled by @huggingface/transformers 4.1.0.

Notable: Bonsai 1.7B from PrismML is available in both q4 (1.1 GB) and q1 (291 MB). The q1 variant is the lightest thinking-capable model in the catalog, designed for low-end GPUs and fast cold starts.

Configuration: Settings > WebGPU. Model selection, quantization level, and context window are auto-configured based on GPU detection.

Ollama (Local Server)

Connects to a local Ollama server via the OpenAI-compatible API.

Property	Value
Data residency	Local network
Internet required	No (LAN only)
Tool calling	Yes (OpenAI function format)
Streaming	Yes
Thinking/reasoning	Yes (think-block stripping)
Cost	Free (self-hosted)

Configuration: Settings > Ollama.

Setting	Default	Description
Base URL	`http://localhost:11434`	Ollama server address
Model	—	Selected from detected installed models
Availability timeout	3,000 ms	How long to wait for server probe
Model list timeout	5,000 ms	How long to wait for model enumeration

Daneel auto-probes the Ollama server on settings open. Model management (pull, delete) is available in the Ollama settings panel.

Gemini Nano (Chrome Built-in)

Uses Chrome’s built-in Gemini Nano model via the Chrome AI API.

Property	Value
Data residency	On-device
Internet required	No
Tool calling	Experimental (prompt-based XML tags)
Streaming	Yes
Thinking/reasoning	No
Cost	Free

Configuration: Settings > Gemini Nano. Language selection. Availability is auto-detected — requires Chrome 120+ with the Gemini Nano flag enabled.

Claude (Anthropic API)

Connects to Anthropic’s Claude models via the API.

Property	Value
Data residency	Third-party cloud (Anthropic)
Internet required	Yes
Tool calling	Yes (native `tool_use` blocks)
Streaming	Yes (SSE)
Thinking/reasoning	Yes
Cost	Per-token (see below)

Available models:

Model	Input cost	Output cost	Context
Claude Opus 4.7	$5 / 1M tokens	$25 / 1M tokens	200K
Claude Opus 4.6	$5 / 1M tokens	$25 / 1M tokens	200K
Claude Sonnet 4.6	$3 / 1M tokens	$15 / 1M tokens	200K
Claude Haiku 4.5	$1 / 1M tokens	$5 / 1M tokens	200K

Cost annotations appear next to each response in the chat panel.

Configuration: Settings > Claude. API key is encrypted with AES-256-GCM and stored locally. The key never leaves your browser unencrypted.

Azure OpenAI (Enterprise)

Connects to Azure OpenAI Service deployments.

Property	Value
Data residency	Your Azure tenant
Internet required	Yes
Tool calling	Yes (OpenAI function format)
Streaming	Yes
Thinking/reasoning	Model-dependent
Cost	Per your Azure pricing

Authentication: API Key or Entra ID (OAuth2). See How to Set Up Azure OpenAI for configuration steps.

Embedding Providers

Daneel uses a local embedding model for all vector operations (site indexing, vault search, knowledge graph).

Model	Dimensions	Context	Backend	Size
BGE Small EN v1.5 (default)	384	512 tokens	WebGPU fp16	~23 MB
Granite Embedding	384	1,024 tokens	WebGPU q8	—
MiniLM-L6-v2	384	256 tokens	WebGPU q8	—

Embeddings always run locally regardless of your LLM provider choice. Batched at 32 chunks maximum to prevent GPU memory issues.

Vector Search

Implementation	Persistence	Use case
IndexedDBVectorStore	Persistent (survives browser restart)	Production — site indexes, vaults
GPUCosineSearch	In-memory (GPU-accelerated)	<5ms search over 50k+ chunks
InMemoryVectorStore	Ephemeral	Testing only

Tool Calling Support by Provider

Provider	Strategy	Reliability	Notes
Claude	Native `tool_use` blocks	High	Best MCP experience
Ollama	OpenAI function format	High	Depends on model
Azure OpenAI	OpenAI function format	High	Depends on deployment
WebGPU	Prompt-based XML tags	Low	Small models struggle with tool format
Gemini Nano	Prompt-based XML tags	Low	3B model often misformats calls