
To get started with speech, see [How to Read Messages Aloud and Dictate Questions](/how-to/speech/). For the design rationale behind multiple providers, see [Speech in Daneel](/concepts/speech/).

## Text-to-speech providers

Daneel supports three TTS providers. All implement the same interface, switching is a single click in **Settings > Speech**.

### System voices (default)

Uses the browser's built-in Speech Synthesis API. The voice catalog is whatever your operating system provides.

| Property | Value |
|---|---|
| Provider id | `web-speech` |
| Data residency | On-device (mostly) |
| Download size | 0 MB |
| Internet required | No (except for optional cloud voices) |
| Languages | All voices your OS provides |
| Streaming start | Instant |
| Cancellation latency | ~100 ms |

Chrome exposes a subset of voices named `Google <language>` that stream text to Google servers for higher-quality prosody. Daneel filters these by default. Flip **Settings > Speech > Advanced > Allow Google cloud voices** to expose them. They are clearly marked `(cloud)` in the voice list.

### Kokoro 82M (local)

A neural TTS model running entirely in your browser on WebGPU. 82 million parameters, 54 voices, seven languages.

| Property | Value |
|---|---|
| Provider id | `kokoro` |
| Data residency | On-device |
| Download size | ~326 MB (one-time) |
| Internet required | First download only |
| Languages | en-US, en-GB, es, fr, it, hi, ja, zh |
| Quantization (dtype) | fp32 (Xenova reference config for WebGPU) |
| Sample rate | 24 kHz mono |
| Cache location | Browser Cache API (`transformers-cache` + `kokoro-voices`) |

The 54-voice list is split by locale and gender. High-quality voices are marked with emoji in kokoro-js's voice table (Heart ❤️, Bella 🔥, Nicole 🎧, Emma 🚺, George 🚹).

Voice style files are fetched on first use per voice and cached separately under the `kokoro-voices` Cache API bucket.

:::note
Kokoro uses fp32 on WebGPU, not a smaller quantization, because quantized variants force dequantization ops onto the CPU, slowing synthesis 3 to 5 times. The trade-off is a larger download for a much faster runtime.
:::

### Moonshine (coming soon)

Placeholder provider. Catalog entries exist in the provider picker but the card remains disabled. When the provider class ships, it will extend local speech recognition with the same privacy guarantees as Kokoro.

## Speech-to-text providers

### Browser speech recognition (default)

Uses Chrome's built-in `SpeechRecognition` API. Audio streams to Google servers for transcription.

| Property | Value |
|---|---|
| Provider id | `web-speech` |
| Data residency | Third-party cloud (Google) |
| Download size | 0 MB |
| Internet required | Yes |
| Languages | Any BCP-47 tag supported by Chrome |
| Offline Mode behavior | Blocked, mic button disables with tooltip |

Set the recognition language in **Settings > Speech > Speech recognition > Language**. The default is `en-US`.

### Moonshine Base / Tiny (coming soon)

Two sizes of a local English speech recognition model. Catalog entries exist, provider classes pending.

| Variant | Download | Use case |
|---|---|---|
| Moonshine Base | ~120 MB | Best accuracy |
| Moonshine Tiny | ~55 MB | Low-end devices |

## Settings reference

All speech controls live under **Settings > Speech**, split into two sections.

### Text-to-speech section

| Control | Values | Default |
|---|---|---|
| Enabled | on / off | on |
| Provider | System voices / Kokoro 82M | System voices |
| Voice | provider-specific list | provider default |
| Speed | 0.5× to 2.0× | 1.0× |
| Auto-read responses | on / off | off |
| Allow Google cloud voices | on / off | off (advanced) |

The voice picker updates based on the active provider. Kokoro's picker is populated after the model is cached; before that, the card shows a Download button instead of the picker.

### Speech recognition section

| Control | Values | Default |
|---|---|---|
| Enabled | on / off | on |
| Provider | Browser speech recognition / Moonshine Base / Moonshine Tiny | Browser speech recognition |
| Recognition language | BCP-47 tag | `en-US` |

## Keyboard shortcut

`Alt+Space` toggles dictation from anywhere on the page. The shortcut is registered via the `toggle-stt` Chrome extension command and can be reassigned at `chrome://extensions/shortcuts`.

## UI affordances

- **Play button** — appears in the hover action row on every assistant message, between Copy and Delete. Flips to **Stop** when that message is playing.
- **Mic button** — appears in the chat composer next to Send. Four states: idle (grey), requesting-permission (amber, pulsing), listening (red, pulsing), transcribing (amber, static).
- **Test button** — next to the voice picker in Settings. Plays a short sample of the currently selected voice at the current rate.
- **Cloud badge** — the `(cloud)` suffix on voice list entries indicates a voice that streams text to a remote service. Visible only when Allow Google cloud voices is enabled.

## Privacy profiles

Each provider carries a `PrivacyProfile` consulted by the [Offline Mode](/how-to/offline/) network gate.

| Provider | leavesProcess | leavesMachine | dataObservers |
|---|---|---|---|
| System voices (local) | true | false | browser-vendor |
| System voices (Google cloud) | true | true | browser-vendor |
| Kokoro 82M | false | false | none |
| Web Speech STT | true | true | browser-vendor |
| Moonshine (planned) | false | false | none |

When `leavesMachine: true` and Offline Mode is effective, the network gate blocks the call and the relevant UI affordance (mic button, cloud-voice playback) disables with a tooltip.

## What Daneel never touches

- **Raw audio waveforms are not persisted.** Neither the PCM produced by Kokoro nor the audio captured by the mic is written to storage. Everything lives in memory for the duration of the playback or recording.
- **Transcripts are not saved outside the chat message.** When dictation completes, the text lands in the composer. If you do not send the message, nothing is stored.
- **No telemetry includes speech content.** The analytics catalog explicitly forbids logging transcripts, voice IDs the user typed, or error messages; only enums, booleans, durations, and character counts are emitted.
