Skip to content

ModelDType

ModelDType = "fp32" | "fp16" | "q8" | "q4" | "q4f16" | "q2" | "q2f16" | "q1" | "q1f16"

Defined in: LLMProvider.ts:19

ONNX quantization precision / dtype supported by LLM and embedding providers.

Controls the trade-off between model quality and resource usage:

fp32 — full precision, largest memory footprint
fp16 — half precision, native WebGPU compute format
q8 — 8-bit quantization, good balance of quality and size
q4 — 4-bit quantization, smallest footprint (~600 MB for a 1.2B model)
q4f16 — 4-bit with fp16 compute (requires shader-f16 GPU feature)
q2 — 2-bit quantization (requires transformers.js >= 4.1.0)
q2f16 — 2-bit with fp16 compute (requires transformers.js >= 4.1.0)
q1 — 1-bit quantization, ultra-compact (requires transformers.js >= 4.1.0)
q1f16 — 1-bit with fp16 compute (requires transformers.js >= 4.1.0)

Remarks

Defined here so both models.config.ts and WorkerMessages.ts can import from core without creating a circular dependency.