ModelDType
ModelDType =
"fp32"|"fp16"|"q8"|"q4"|"q4f16"|"q2"|"q2f16"|"q1"|"q1f16"
Defined in: LLMProvider.ts:19
ONNX quantization precision / dtype supported by LLM and embedding providers.
Controls the trade-off between model quality and resource usage:
fp32— full precision, largest memory footprintfp16— half precision, native WebGPU compute formatq8— 8-bit quantization, good balance of quality and sizeq4— 4-bit quantization, smallest footprint (~600 MB for a 1.2B model)q4f16— 4-bit with fp16 compute (requiresshader-f16GPU feature)q2— 2-bit quantization (requires transformers.js >= 4.1.0)q2f16— 2-bit with fp16 compute (requires transformers.js >= 4.1.0)q1— 1-bit quantization, ultra-compact (requires transformers.js >= 4.1.0)q1f16— 1-bit with fp16 compute (requires transformers.js >= 4.1.0)
Remarks
Section titled “Remarks”Defined here so both models.config.ts and WorkerMessages.ts can import
from core without creating a circular dependency.