Skip to content

ModelDType

ModelDType = "fp32" | "fp16" | "q8" | "q4" | "q4f16" | "q2" | "q2f16" | "q1" | "q1f16"

Defined in: LLMProvider.ts:19

ONNX quantization precision / dtype supported by LLM and embedding providers.

Controls the trade-off between model quality and resource usage:

  • fp32 — full precision, largest memory footprint
  • fp16 — half precision, native WebGPU compute format
  • q8 — 8-bit quantization, good balance of quality and size
  • q4 — 4-bit quantization, smallest footprint (~600 MB for a 1.2B model)
  • q4f16 — 4-bit with fp16 compute (requires shader-f16 GPU feature)
  • q2 — 2-bit quantization (requires transformers.js >= 4.1.0)
  • q2f16 — 2-bit with fp16 compute (requires transformers.js >= 4.1.0)
  • q1 — 1-bit quantization, ultra-compact (requires transformers.js >= 4.1.0)
  • q1f16 — 1-bit with fp16 compute (requires transformers.js >= 4.1.0)

Defined here so both models.config.ts and WorkerMessages.ts can import from core without creating a circular dependency.