A high-end 70B model would normally take 140GB of RAM. Quantization is the compression technique that allows us to run these massive models on consumer hardware (like your laptop) with almost zero loss in intelligence.
Models are usually trained with 32-bit floating point numbers. Quantization rounds these numbers down to 8-bit or even 4-bit. This reduces the file size by 75% or more! It's like converting a lossless WAV file to a high-quality MP3.
Q: "Does quantization make a model 'Dumber'?"
Architect Answer: "Technically yes, there is a slight 'Perplexity Increase' (error rate). However, for most tasks, a 4-bit quantized model is virtually indistinguishable from the full-precision version. It is only when you go below 3-bit that the model begins to 'hallucinate' or lose coherence. For any architect, the trade-off of 4x less RAM for a 1% drop in accuracy is a massive win."