Speech in Daneel

Speech is one of the few features where the same user action maps to three genuinely different trade-offs. Reading a reply aloud can stay entirely on your device, or it can stream text to a cloud service for better prosody, or it can land somewhere in between with OS-provided voices. Daneel exposes that choice honestly rather than picking for you.

This page explains the spectrum, why Kokoro gets to be the privacy-first option, and a few engineering decisions that shape how speech actually feels in use.

The provider spectrum

Every text-to-speech provider in Daneel carries a privacy profile describing where your text goes. Two fields matter: leavesProcess (does the text cross the browser sandbox?) and leavesMachine (does it leave your device?).

Provider	leaves process	leaves machine	Observer
Kokoro 82M	no	no	none
System voices (local)	yes	no	browser vendor
System voices (Google cloud)	yes	yes	browser vendor

Kokoro is the only option where no component outside Daneel ever sees your text. System voices go through the browser’s Speech Synthesis engine, which is part of the OS but not Daneel. Google cloud voices genuinely leave your machine for richer prosody. The three steps are intentional, and they are visible in the settings panel as privacy pills.

For the parallel spectrum applied to speech recognition, see the provider table in Speech Reference. The short version: the browser recognizer streams to Google today; the coming Moonshine provider will close that gap with a local model.

Why Kokoro is fp32 on WebGPU

Kokoro’s ONNX weights ship in several quantizations: fp32 (~326 MB), fp16 (~165 MB), q8 (~80 MB), and a few others. The instinct is to pick the smallest to minimize download. For WebGPU, the instinct is wrong.

Quantized ONNX models on WebGPU rely on dequantization ops that, as of writing, have no WebGPU kernel. ONNX Runtime quietly assigns those ops to CPU, which bounces tensors across the CPU/GPU boundary on every forward pass. The result is synthesis latency measured in tens of seconds for what should take one.

Daneel uses fp32 on WebGPU because pure fp32 inference has no such ops and runs entirely on the GPU. The download is larger, the download is a one-time cost, and the runtime is 3 to 5 times faster. The kokoro-js library’s README recommends the same combination. When a faster quantization is a good match, it is a good match for WASM (CPU) execution, not for WebGPU.

If your hardware does not support WebGPU, Daneel falls back to the System voices provider. Kokoro is not a viable option on WASM-only devices today.

Host-owned audio and gapless playback

A user-visible detail that only exists because of a specific architecture: audio keeps playing when you switch tabs.

The AudioContext that plays Kokoro’s PCM is owned by Daneel’s background host page, not by the tab you are browsing. When Kokoro produces a chunk of audio, the PCM is posted from the worker to the host page. The host page’s AudioContext schedules the buffer on the WebAudio timeline at a specific absolute time, computed as max(currentTime, end_of_previous_chunk). The hardware audio driver executes the schedule; reordering is physically impossible once a chunk is committed.

One consequence is that synthesis can run ahead of playback without introducing gaps. Daneel pipelines chunk N+1’s synthesis while chunk N plays, with a backpressure limit of one chunk ahead. Another consequence is that navigating away from the tab where you started playback does not interrupt the audio. The AudioContext survives the tab switch.

Strict preemption between messages

When you click Play on message A and, partway through, click Play on message B, you expect A to stop cleanly and B to start. No overlap, no lingering last sentence of A.

The implementation is mechanical rather than clever. Every new TTS request at the host first aborts every prior in-flight request and calls reset() on the playback queue. Only then does it register itself. The “latest wins” semantics are enforced at the host boundary, not negotiated between widget and host over a cancel round-trip.

An earlier iteration of this feature relied on the widget sending a tts-cancel before the new tts-synthesize, with host-side handling that assumed they arrived in order. They did not always, which produced an audible bug where message A finished reading while B was already halfway through. Moving the preemption to unconditional host-side abort removed the race.

Why the sanitizer and chunker are boring

Kokoro’s style vector, which controls prosody, is offset into a voice tensor using the token length of the input. Very short inputs (a bare title, an eight-word heading) pick a low-offset region of the tensor where the prosody is unstable and can mangle or repeat the phrase. Slightly longer inputs pick the stable middle region.

The chunker has a boring job: take a markdown message, strip formatting, split on paragraph and sentence boundaries, and emit chunks in a bounded character range. One non-obvious detail is that when the first chunk is tiny (a heading preceding a paragraph), it gets forward-merged into the next chunk even if the combined length exceeds the soft target, up to a hard ceiling. This protects Kokoro from the unstable-prosody region.

Markdown never reaches Kokoro. Headings become plain text, code fences become (Code block.), math becomes (Math formula.), Mermaid diagrams become (Diagram.). Kokoro sees only speakable prose.

Dictation and the permission model

The microphone button in the composer does not prompt the browser for permission until you click it. That is a deliberate decision: Daneel does not want to ask for hardware access at install time, because the user has not yet asked for speech recognition. When you click the mic, Chrome’s permission prompt appears, and it is scoped to the extension origin, so you grant it once and never again.

The transcript does not auto-send. It lands in the composer input so you can see what was heard, correct anything, and send deliberately. In practice this catches dictation errors and sidesteps the awkwardness of accidental sends.

When Offline Mode is active, the mic button disables itself with a tooltip. The reason is that the default recognizer streams audio to Google; the network gate denies it on principle. When the local Moonshine option ships, the mic will stay enabled under Offline Mode with Moonshine selected, because that provider’s privacy profile says the audio never leaves the machine.

Future: voice locality on both sides

Kokoro closed the locality gap for text-to-speech. Moonshine is the same story for the other direction. Once the provider class, worker, and microphone capture pipeline are in place, selecting Moonshine in Settings > Speech > Speech recognition will give you a dictation experience that is indistinguishable from the cloud version, without any audio leaving your device.

The speech catalog is already set up to receive it. The UI path is already there. The remaining piece is the runtime.