TL;DR

Model quantization is a technique that converts large language model weights from high precision (e.g., FP32, FP16) to low precision (e.g., INT8, INT4), significantly reducing model size and inference memory requirements while maintaining good model performance. This guide covers core quantization principles, mainstream quantization methods (GPTQ, AWQ, GGUF), the difference between post-training quantization and quantization-aware training, and provides practical code examples using llama.cpp and bitsandbytes.

Introduction

With the flourishing development of open-source large language models like LLaMA, Qwen, and Mistral, more and more developers want to deploy these models locally or on edge devices. However, a 7B parameter model loaded in FP16 precision requires about 14GB of VRAM, which is a huge challenge for consumer-grade hardware.

Model quantization is the key technology to solve this problem. Through quantization, you can:

  • Reduce the VRAM requirement of a 7B model from 14GB to under 4GB
  • Run large models on consumer GPUs or even CPUs
  • Accelerate inference speed and reduce deployment costs
  • Deploy AI capabilities on mobile and edge devices

In this guide, you will learn:

  • What is model quantization and why it's needed
  • Differences between INT8, INT4, FP16, BF16 quantization types
  • Comparison of Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT)
  • Detailed explanation of GPTQ, AWQ, GGUF quantization methods
  • Practical code for quantization using llama.cpp and bitsandbytes
  • Best practices for deploying quantized models

What is Model Quantization

Basic Concepts of Quantization

Model quantization is a model compression technique that reduces model size and computational requirements by lowering the numerical precision of model weights and activations. Simply put, it uses fewer bits to represent values that originally required more bits.

flowchart LR A["Original Model FP32/FP16"] --> B["Quantization Process"] B --> C["Quantized Model INT8/INT4"] C --> D{"Deployment Target"} D -->|Server| E["High-throughput Inference"] D -->|Consumer GPU| F["Local Deployment"] D -->|Edge Device| G["Lightweight Operation"]

Why Quantization is Needed

Challenge Original Model Issues Quantization Solution
Memory Usage 7B model FP16 needs 14GB INT4 quantized needs only ~4GB
Inference Speed Large model inference has high latency Reduced computation after quantization
Deployment Cost Requires expensive GPU servers Can run on consumer hardware
Power Consumption High precision computation is power-hungry Low precision is more energy-efficient
Bandwidth Slow model transfer and loading Smaller model size, faster loading

Quantization Trade-offs

Quantization is essentially a trade-off between precision and efficiency. Reducing precision brings some accuracy loss, but with proper quantization strategies, this loss can be controlled within acceptable ranges.

code
┌─────────────────────────────────────────────────────────┐
│           Quantization Precision vs Efficiency          │
├─────────────────────────────────────────────────────────┤
│  Precision │ Bits │ Model Size │ Speed │ Accuracy Loss │
├─────────────────────────────────────────────────────────┤
│  FP32      │  32  │   100%    │ Base  │    None       │
│  FP16      │  16  │    50%    │  ~2x  │  Minimal      │
│  BF16      │  16  │    50%    │  ~2x  │  Minimal      │
│  INT8      │   8  │    25%    │  ~3x  │   Small       │
│  INT4      │   4  │   12.5%   │  ~4x  │  Moderate     │
│  INT2      │   2  │   6.25%   │  ~5x  │   Large       │
└─────────────────────────────────────────────────────────┘

Quantization Types Explained

Floating Point Precision: FP32, FP16, BF16

FP32 (Single Precision Float): Standard 32-bit floating point number, provides highest precision but takes the most space.

FP16 (Half Precision Float): 16-bit floating point number, slightly reduced precision but halves model size, currently the mainstream format for LLM training and inference.

BF16 (Brain Float 16): A 16-bit format proposed by Google that retains the exponent range of FP32, performs excellently in deep learning.

python
import torch

fp32_tensor = torch.randn(1000, 1000, dtype=torch.float32)
fp16_tensor = fp32_tensor.to(torch.float16)
bf16_tensor = fp32_tensor.to(torch.bfloat16)

print(f"FP32 Memory: {fp32_tensor.element_size() * fp32_tensor.numel() / 1024:.2f} KB")
print(f"FP16 Memory: {fp16_tensor.element_size() * fp16_tensor.numel() / 1024:.2f} KB")
print(f"BF16 Memory: {bf16_tensor.element_size() * bf16_tensor.numel() / 1024:.2f} KB")

Integer Precision: INT8, INT4

INT8 Quantization: Maps weights to the integer range of -128 to 127, reducing model size to 1/4 of the original, the most commonly used quantization precision in production environments.

INT4 Quantization: Maps weights to the integer range of -8 to 7, reducing model size to 1/8 of the original, suitable for extremely memory-constrained scenarios.

flowchart TB subgraph FP16["FP16 Weights"] F["1.234, -0.567, 2.891, ..."] end subgraph QUANT["Quantization Process"] S["Calculate scale factor"] Z["Calculate zero_point"] Q["Quantize formula"] end subgraph INT8["INT8 Weights"] I["15, -7, 36, ..."] end F --> S S --> Z Z --> Q Q --> I

Quantization Formulas

Symmetric quantization formula:

code
q = round(x / scale)
x_dequant = q * scale

Asymmetric quantization formula:

code
q = round(x / scale) + zero_point
x_dequant = (q - zero_point) * scale

Post-Training Quantization vs Quantization-Aware Training

Post-Training Quantization (PTQ)

Post-Training Quantization directly quantizes weights after model training is complete, without requiring retraining.

flowchart LR A[Pre-trained Model] --> B[Calibration Dataset] B --> C[Analyze Weight Distribution] C --> D[Calculate Quantization Parameters] D --> E[Quantize Weights] E --> F[Quantized Model]

Advantages:

  • Simple and fast, no training required
  • Does not need original training data
  • Suitable for most scenarios

Disadvantages:

  • Low-bit quantization (e.g., INT4) may have significant accuracy loss
  • May not work well for certain model architectures

Quantization-Aware Training (QAT)

Quantization-Aware Training simulates quantization effects during training, allowing the model to learn to adapt to accuracy loss from quantization.

Advantages:

  • Smaller accuracy loss after quantization
  • Suitable for low-bit quantization

Disadvantages:

  • Requires retraining
  • High computational cost
  • Needs training data

PTQ vs QAT Comparison

Dimension PTQ QAT
Training Required No Yes
Time Cost Minutes Hours/Days
Data Required Small calibration set Full training data
INT8 Accuracy Excellent Excellent
INT4 Accuracy Good Excellent
Use Case Quick deployment Maximum accuracy

Mainstream Quantization Methods Explained

GPTQ Quantization

GPTQ (GPT Quantization) is a post-training quantization method based on second-order information, designed specifically for large language models.

Core Principles:

  • Layer-by-layer quantization, minimizing quantization error
  • Uses Hessian matrix approximation to guide quantization
  • Supports INT4/INT3/INT2 low-bit quantization
python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
    group_size=128,
    desc_act=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("./llama-2-7b-gptq")

AWQ Quantization

AWQ (Activation-aware Weight Quantization) is an activation-aware weight quantization method that reduces accuracy loss by protecting important weights.

Core Principles:

  • Observes activation distribution to identify important weights
  • Uses higher precision or special handling for important weights
  • Achieves better balance between accuracy and efficiency
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "./llama-2-7b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GGUF Format

GGUF (GPT-Generated Unified Format) is the quantization format used by the llama.cpp project, optimized for CPU inference.

Features:

  • Supports multiple quantization levels (Q2_K to Q8_0)
  • Highly optimized for CPU inference
  • Supports memory mapping for fast loading
  • Good cross-platform compatibility

Common GGUF Quantization Levels:

Quantization Type Bits Model Size (7B) Quality
Q2_K 2.5 ~2.5GB Lower
Q3_K_M 3.5 ~3.3GB Medium
Q4_K_M 4.5 ~4.1GB Good
Q5_K_M 5.5 ~4.8GB Excellent
Q6_K 6.5 ~5.5GB Near Original
Q8_0 8 ~7.2GB Nearly Lossless

Quantization Method Comparison

flowchart TD A[Choose Quantization Method] --> B{Deployment Environment?} B -->|GPU Server| C{Speed or Accuracy Priority?} B -->|Consumer GPU| D["AWQ/GPTQ INT4"] B -->|CPU/Edge Device| E[GGUF Format] C -->|Speed Priority| F[AWQ] C -->|Accuracy Priority| G[GPTQ] D --> H[Local Deployment Solution] E --> I[llama.cpp Deployment] F --> J["vLLM/TGI Deployment"] G --> J

llama.cpp Quantization in Practice

Installing llama.cpp

bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

make LLAMA_CUBLAS=1 -j

Converting and Quantizing Models

bash
python convert_hf_to_gguf.py /path/to/llama-2-7b --outfile llama-2-7b-f16.gguf --outtype f16

./llama-quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

Inference with Quantized Models

bash
./llama-cli -m llama-2-7b-q4_k_m.gguf \
    -p "What is machine learning?" \
    -n 256 \
    --temp 0.7 \
    --top-p 0.9

Using Python Bindings

python
from llama_cpp import Llama

llm = Llama(
    model_path="./llama-2-7b-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8,
    n_gpu_layers=35,
)

output = llm(
    "Q: What is model quantization?\nA:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["Q:", "\n\n"],
)

print(output["choices"][0]["text"])

bitsandbytes Quantization in Practice

Installing bitsandbytes

bash
pip install bitsandbytes transformers accelerate

8-bit Quantization Loading

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"

bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Model Memory Usage: {model_8bit.get_memory_footprint() / 1024**3:.2f} GB")

4-bit Quantization Loading (NF4)

python
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_4bit,
    device_map="auto",
)

print(f"Model Memory Usage: {model_4bit.get_memory_footprint() / 1024**3:.2f} GB")

Quantized Model Inference

python
def generate_text(model, tokenizer, prompt, max_length=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

prompt = "Explain what model quantization is:"
result = generate_text(model_4bit, tokenizer, prompt)
print(result)

Impact of Quantization on Model Performance

Accuracy Impact Assessment

python
from datasets import load_dataset
from evaluate import load
import numpy as np

def evaluate_quantized_model(model, tokenizer, dataset, num_samples=100):
    perplexity_metric = load("perplexity", module_type="metric")
    
    texts = dataset["text"][:num_samples]
    
    results = perplexity_metric.compute(
        predictions=texts,
        model_id=model,
        add_start_token=False,
    )
    
    return results["mean_perplexity"]

def compare_models(original_model, quantized_model, tokenizer, test_prompts):
    results = []
    
    for prompt in test_prompts:
        original_output = generate_text(original_model, tokenizer, prompt)
        quantized_output = generate_text(quantized_model, tokenizer, prompt)
        
        results.append({
            "prompt": prompt,
            "original": original_output,
            "quantized": quantized_output,
        })
    
    return results

Performance Benchmarks

Model Configuration Memory Usage Inference Speed (tokens/s) Perplexity
LLaMA-7B FP16 14GB 25 5.68
LLaMA-7B INT8 7GB 35 5.72
LLaMA-7B INT4 (GPTQ) 4GB 45 5.85
LLaMA-7B INT4 (AWQ) 4GB 50 5.79
LLaMA-7B Q4_K_M (GGUF) 4GB 40 (CPU) 5.82

Deploying Quantized Models

Deploying with vLLM

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16",
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Deploying GGUF Models with Ollama

bash
ollama create mymodel -f Modelfile

ollama run mymodel "What is model quantization?"

Modelfile example:

code
FROM ./llama-2-7b-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 2048

SYSTEM You are a helpful AI assistant.

Deployment Architecture Selection

flowchart TD A[Quantized Model Deployment] --> B{Concurrency Requirements} B -->|High Concurrency| C["vLLM/TGI"] B -->|Low Concurrency| D{Hardware Environment} D -->|GPU| E["Transformers + bitsandbytes"] D -->|CPU| F["llama.cpp / Ollama"] C --> G[Production API Service] E --> H["Development/Testing Environment"] F --> I["Edge Deployment/Local Apps"]

The following tools can improve your efficiency during model quantization and LLM development:

FAQ

How much accuracy is lost through quantization?

Accuracy loss depends on the quantization method and bit precision. INT8 quantization typically has minimal accuracy loss (<1%), while INT4 quantization has about 2-5% loss. Using advanced methods like GPTQ and AWQ can keep INT4 accuracy loss within 3%. For most application scenarios, this loss is acceptable.

Which quantization method is best?

There is no absolute best method; it depends on your needs. If you prioritize inference speed, AWQ is a good choice; if you prioritize accuracy, GPTQ performs better; if you need to run on CPU, GGUF format is the best choice. It's recommended to test and compare based on your actual scenario.

Can quantized models be fine-tuned?

Yes, but with limitations. QLoRA technology allows parameter-efficient fine-tuning on quantized models. The combination of bitsandbytes 4-bit quantized models with LoRA is currently the most popular low-resource fine-tuning solution, enabling fine-tuning of 7B or even 13B models on a single consumer GPU.

Should I choose INT4 or INT8?

If you have sufficient memory, prefer INT8 for smaller accuracy loss. If memory is tight or you need to run on consumer GPUs, INT4 is the better choice. For models larger than 13B, INT4 is almost mandatory, otherwise memory requirements are too high.

Will inference speed always improve after quantization?

Not necessarily. After quantization, model size decreases and memory bandwidth requirements are reduced, but additional dequantization computation is needed. On GPUs, INT8/INT4 quantization usually improves speed; on CPUs, GGUF format quantized models show more significant speed improvements. Actual results need to be tested based on hardware and model.

How to evaluate quantized model quality?

Common evaluation methods include: perplexity testing, downstream task benchmarks (such as MMLU, HellaSwag), and manual evaluation of output quality. It's recommended to prepare a test set related to your actual application scenario and compare output differences before and after quantization.

Summary

Model quantization is the key technology for deploying large language models in resource-constrained environments. Through this guide, you have learned:

  1. Quantization Principles: Reduce model size and computational requirements by lowering numerical precision
  2. Quantization Types: FP16, BF16, INT8, INT4 each have their suitable scenarios
  3. PTQ vs QAT: Post-training quantization is quick and convenient, quantization-aware training has higher accuracy
  4. Mainstream Methods: GPTQ for accuracy, AWQ for speed, GGUF for CPU deployment
  5. Practical Code: Complete usage examples for llama.cpp and bitsandbytes
  6. Deployment Solutions: Choose appropriate deployment architecture based on concurrency requirements and hardware environment

By mastering model quantization technology, you can deploy powerful AI capabilities on limited hardware resources, truly bringing large language models to everyone.