Quantization

Quantization is a technique to:

  • Compress large models by reducing precision (e.g., from float32 → int8).
  • Make them run faster, use less memory, and even run on CPU or mobile.

Tools like llama.cpp, ggml, and mlc-llm quantize models to make them run on M1 chips, Raspberry Pi, or Android.