What is the difference between INT8 and INT4 quantization?

INT8 quantization uses 8 bits per weight, reducing model size to 25% of FP32 with minimal accuracy loss. INT4 uses only 4 bits, shrinking the model to 12.5% of its original size but with slightly more quality degradation.

How does GPTQ compare to AWQ quantization?

GPTQ uses one-shot weight quantization based on approximate second-order information and is optimized for GPU inference. AWQ (Activation-aware Weight Quantization) preserves salient weights based on activation patterns, often achieving better quality at the same bit width.

When should I use model quantization?

Use quantization when you need to deploy LLMs on consumer GPUs, edge devices, or resource-constrained environments. It is ideal when you want to reduce inference costs, lower latency, or run models locally without enterprise-grade hardware.

Does quantization affect model accuracy?

Yes, quantization introduces some accuracy loss since it reduces numerical precision. However, modern methods like GPTQ and AWQ keep this loss minimal — INT8 quantization typically has negligible impact, while INT4 may show moderate degradation on complex reasoning tasks.

What is Model Quantization? INT8, GPTQ & AWQ Explained

Q: What is model quantization?

Model quantization is a compression technique that converts LLM weights from high precision (FP32/FP16) to lower precision (INT8/INT4), reducing model size by up to 75% and memory requirements while maintaining acceptable performance.

2026-02-21 - QubitTool Team

TL;DR

Model quantization is a technique that converts large language model weights from high precision (e.g., FP32, FP16) to low precision (e.g., INT8, INT4), significantly reducing model size and inference memory requirements while maintaining good model performance. This guide covers core quantization principles, mainstream quantization methods (GPTQ, AWQ, GGUF), the difference between post-training quantization and quantization-aware training, and provides practical code examples using llama.cpp and bitsandbytes.

Introduction

With the flourishing development of open-source large language models like LLaMA, Qwen, and Mistral, more and more developers want to deploy these models locally or on edge devices. However, a 7B parameter model loaded in FP16 precision requires about 14GB of VRAM, which is a huge challenge for consumer-grade hardware.

Model quantization is the key technology to solve this problem. Through quantization, you can:

Reduce the VRAM requirement of a 7B model from 14GB to under 4GB
Run large models on consumer GPUs or even CPUs
Accelerate inference speed and reduce deployment costs
Deploy AI capabilities on mobile and edge devices

In this guide, you will learn:

What is model quantization and why it's needed
Differences between INT8, INT4, FP16, BF16 quantization types
Comparison of Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT)
Detailed explanation of GPTQ, AWQ, GGUF quantization methods
Practical code for quantization using llama.cpp and bitsandbytes
Best practices for deploying quantized models

What is Model Quantization

Basic Concepts of Quantization

Model quantization is a model compression technique that reduces model size and computational requirements by lowering the numerical precision of model weights and activations. Simply put, it uses fewer bits to represent values that originally required more bits.

flowchart LR A["Original Model FP32/FP16"] --> B["Quantization Process"] B --> C["Quantized Model INT8/INT4"] C --> D{"Deployment Target"} D -->|Server| E["High-throughput Inference"] D -->|Consumer GPU| F["Local Deployment"] D -->|Edge Device| G["Lightweight Operation"]

Why Quantization is Needed

Challenge	Original Model Issues	Quantization Solution
Memory Usage	7B model FP16 needs 14GB	INT4 quantized needs only ~4GB
Inference Speed	Large model inference has high latency	Reduced computation after quantization
Deployment Cost	Requires expensive GPU servers	Can run on consumer hardware
Power Consumption	High precision computation is power-hungry	Low precision is more energy-efficient
Bandwidth	Slow model transfer and loading	Smaller model size, faster loading

Quantization Trade-offs

Quantization is essentially a trade-off between precision and efficiency. Reducing precision brings some accuracy loss, but with proper quantization strategies, this loss can be controlled within acceptable ranges.

code

┌─────────────────────────────────────────────────────────┐
│           Quantization Precision vs Efficiency          │
├─────────────────────────────────────────────────────────┤
│  Precision │ Bits │ Model Size │ Speed │ Accuracy Loss │
├─────────────────────────────────────────────────────────┤
│  FP32      │  32  │   100%    │ Base  │    None       │
│  FP16      │  16  │    50%    │  ~2x  │  Minimal      │
│  BF16      │  16  │    50%    │  ~2x  │  Minimal      │
│  INT8      │   8  │    25%    │  ~3x  │   Small       │
│  INT4      │   4  │   12.5%   │  ~4x  │  Moderate     │
│  INT2      │   2  │   6.25%   │  ~5x  │   Large       │
└─────────────────────────────────────────────────────────┘

Quantization Types Explained

Floating Point Precision: FP32, FP16, BF16

FP32 (Single Precision Float): Standard 32-bit floating point number, provides highest precision but takes the most space.

FP16 (Half Precision Float): 16-bit floating point number, slightly reduced precision but halves model size, currently the mainstream format for LLM training and inference.

BF16 (Brain Float 16): A 16-bit format proposed by Google that retains the exponent range of FP32, performs excellently in deep learning.

python

import torch

fp32_tensor = torch.randn(1000, 1000, dtype=torch.float32)
fp16_tensor = fp32_tensor.to(torch.float16)
bf16_tensor = fp32_tensor.to(torch.bfloat16)

print(f"FP32 Memory: {fp32_tensor.element_size() * fp32_tensor.numel() / 1024:.2f} KB")
print(f"FP16 Memory: {fp16_tensor.element_size() * fp16_tensor.numel() / 1024:.2f} KB")
print(f"BF16 Memory: {bf16_tensor.element_size() * bf16_tensor.numel() / 1024:.2f} KB")

Integer Precision: INT8, INT4

INT8 Quantization: Maps weights to the integer range of -128 to 127, reducing model size to 1/4 of the original, the most commonly used quantization precision in production environments.

INT4 Quantization: Maps weights to the integer range of -8 to 7, reducing model size to 1/8 of the original, suitable for extremely memory-constrained scenarios.

flowchart TB subgraph FP16["FP16 Weights"] F["1.234, -0.567, 2.891, ..."] end subgraph QUANT["Quantization Process"] S["Calculate scale factor"] Z["Calculate zero_point"] Q["Quantize formula"] end subgraph INT8["INT8 Weights"] I["15, -7, 36, ..."] end F --> S S --> Z Z --> Q Q --> I

Quantization Formulas

Symmetric quantization formula:

code

q = round(x / scale)
x_dequant = q * scale

Asymmetric quantization formula:

code

q = round(x / scale) + zero_point
x_dequant = (q - zero_point) * scale

Post-Training Quantization vs Quantization-Aware Training

Post-Training Quantization (PTQ)

Post-Training Quantization directly quantizes weights after model training is complete, without requiring retraining.

flowchart LR A[Pre-trained Model] --> B[Calibration Dataset] B --> C[Analyze Weight Distribution] C --> D[Calculate Quantization Parameters] D --> E[Quantize Weights] E --> F[Quantized Model]

Advantages:

Simple and fast, no training required
Does not need original training data
Suitable for most scenarios

Disadvantages:

Low-bit quantization (e.g., INT4) may have significant accuracy loss
May not work well for certain model architectures

Quantization-Aware Training (QAT)

Quantization-Aware Training simulates quantization effects during training, allowing the model to learn to adapt to accuracy loss from quantization.

Advantages:

Smaller accuracy loss after quantization
Suitable for low-bit quantization

Disadvantages:

Requires retraining
High computational cost
Needs training data

PTQ vs QAT Comparison

Dimension	PTQ	QAT
Training Required	No	Yes
Time Cost	Minutes	Hours/Days
Data Required	Small calibration set	Full training data
INT8 Accuracy	Excellent	Excellent
INT4 Accuracy	Good	Excellent
Use Case	Quick deployment	Maximum accuracy

Mainstream Quantization Methods Explained

GPTQ Quantization

GPTQ (GPT Quantization) is a post-training quantization method based on second-order information, designed specifically for large language models.

Core Principles:

Layer-by-layer quantization, minimizing quantization error
Uses Hessian matrix approximation to guide quantization
Supports INT4/INT3/INT2 low-bit quantization

python

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
    group_size=128,
    desc_act=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("./llama-2-7b-gptq")

AWQ Quantization

AWQ (Activation-aware Weight Quantization) is an activation-aware weight quantization method that reduces accuracy loss by protecting important weights.

Core Principles:

Observes activation distribution to identify important weights
Uses higher precision or special handling for important weights
Achieves better balance between accuracy and efficiency

python

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "./llama-2-7b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GGUF Format

GGUF (GPT-Generated Unified Format) is the quantization format used by the llama.cpp project, optimized for CPU inference.

Features:

Supports multiple quantization levels (Q2_K to Q8_0)
Highly optimized for CPU inference
Supports memory mapping for fast loading
Good cross-platform compatibility

Common GGUF Quantization Levels:

Quantization Type	Bits	Model Size (7B)	Quality
Q2_K	2.5	~2.5GB	Lower
Q3_K_M	3.5	~3.3GB	Medium
Q4_K_M	4.5	~4.1GB	Good
Q5_K_M	5.5	~4.8GB	Excellent
Q6_K	6.5	~5.5GB	Near Original
Q8_0	8	~7.2GB	Nearly Lossless

Quantization Method Comparison

flowchart TD A[Choose Quantization Method] --> B{Deployment Environment?} B -->|GPU Server| C{Speed or Accuracy Priority?} B -->|Consumer GPU| D["AWQ/GPTQ INT4"] B -->|CPU/Edge Device| E[GGUF Format] C -->|Speed Priority| F[AWQ] C -->|Accuracy Priority| G[GPTQ] D --> H[Local Deployment Solution] E --> I[llama.cpp Deployment] F --> J["vLLM/TGI Deployment"] G --> J

llama.cpp Quantization in Practice

Installing llama.cpp

bash

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

make LLAMA_CUBLAS=1 -j

Converting and Quantizing Models

bash

python convert_hf_to_gguf.py /path/to/llama-2-7b --outfile llama-2-7b-f16.gguf --outtype f16

./llama-quantize llama-2-7b-f16.gguf llama-2-7b-q4_k_m.gguf q4_k_m

Inference with Quantized Models

bash

./llama-cli -m llama-2-7b-q4_k_m.gguf \
    -p "What is machine learning?" \
    -n 256 \
    --temp 0.7 \
    --top-p 0.9

Using Python Bindings

python

from llama_cpp import Llama

llm = Llama(
    model_path="./llama-2-7b-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8,
    n_gpu_layers=35,
)

output = llm(
    "Q: What is model quantization?\nA:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["Q:", "\n\n"],
)

print(output["choices"][0]["text"])

bitsandbytes Quantization in Practice

Installing bitsandbytes

bash

pip install bitsandbytes transformers accelerate

8-bit Quantization Loading

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"

bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Model Memory Usage: {model_8bit.get_memory_footprint() / 1024**3:.2f} GB")

4-bit Quantization Loading (NF4)

python

bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_4bit,
    device_map="auto",
)

print(f"Model Memory Usage: {model_4bit.get_memory_footprint() / 1024**3:.2f} GB")

Quantized Model Inference

python

def generate_text(model, tokenizer, prompt, max_length=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

prompt = "Explain what model quantization is:"
result = generate_text(model_4bit, tokenizer, prompt)
print(result)

Impact of Quantization on Model Performance

Accuracy Impact Assessment

python

from datasets import load_dataset
from evaluate import load
import numpy as np

def evaluate_quantized_model(model, tokenizer, dataset, num_samples=100):
    perplexity_metric = load("perplexity", module_type="metric")
    
    texts = dataset["text"][:num_samples]
    
    results = perplexity_metric.compute(
        predictions=texts,
        model_id=model,
        add_start_token=False,
    )
    
    return results["mean_perplexity"]

def compare_models(original_model, quantized_model, tokenizer, test_prompts):
    results = []
    
    for prompt in test_prompts:
        original_output = generate_text(original_model, tokenizer, prompt)
        quantized_output = generate_text(quantized_model, tokenizer, prompt)
        
        results.append({
            "prompt": prompt,
            "original": original_output,
            "quantized": quantized_output,
        })
    
    return results

Performance Benchmarks

Model Configuration	Memory Usage	Inference Speed (tokens/s)	Perplexity
LLaMA-7B FP16	14GB	25	5.68
LLaMA-7B INT8	7GB	35	5.72
LLaMA-7B INT4 (GPTQ)	4GB	45	5.85
LLaMA-7B INT4 (AWQ)	4GB	50	5.79
LLaMA-7B Q4_K_M (GGUF)	4GB	40 (CPU)	5.82

Deploying Quantized Models

Deploying with vLLM

python

from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16",
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Deploying GGUF Models with Ollama

bash

ollama create mymodel -f Modelfile

ollama run mymodel "What is model quantization?"

Modelfile example:

code

FROM ./llama-2-7b-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 2048

SYSTEM You are a helpful AI assistant.

Deployment Architecture Selection

flowchart TD A[Quantized Model Deployment] --> B{Concurrency Requirements} B -->|High Concurrency| C["vLLM/TGI"] B -->|Low Concurrency| D{Hardware Environment} D -->|GPU| E["Transformers + bitsandbytes"] D -->|CPU| F["llama.cpp / Ollama"] C --> G[Production API Service] E --> H["Development/Testing Environment"] F --> I["Edge Deployment/Local Apps"]

Recommended Tools

The following tools can improve your efficiency during model quantization and LLM development:

JSON Formatter - Format model configuration files and quantization parameters
Text Diff Tool - Compare model output differences before and after quantization
Base Converter - Understand binary representations of different precisions
Percentage Calculator - Calculate model size, memory requirements, etc.

FAQ

How much accuracy is lost through quantization?

Accuracy loss depends on the quantization method and bit precision. INT8 quantization typically has minimal accuracy loss (<1%), while INT4 quantization has about 2-5% loss. Using advanced methods like GPTQ and AWQ can keep INT4 accuracy loss within 3%. For most application scenarios, this loss is acceptable.

Which quantization method is best?

There is no absolute best method; it depends on your needs. If you prioritize inference speed, AWQ is a good choice; if you prioritize accuracy, GPTQ performs better; if you need to run on CPU, GGUF format is the best choice. It's recommended to test and compare based on your actual scenario.

Can quantized models be fine-tuned?

Yes, but with limitations. QLoRA technology allows parameter-efficient fine-tuning on quantized models. The combination of bitsandbytes 4-bit quantized models with LoRA is currently the most popular low-resource fine-tuning solution, enabling fine-tuning of 7B or even 13B models on a single consumer GPU.

Should I choose INT4 or INT8?

If you have sufficient memory, prefer INT8 for smaller accuracy loss. If memory is tight or you need to run on consumer GPUs, INT4 is the better choice. For models larger than 13B, INT4 is almost mandatory, otherwise memory requirements are too high.

Will inference speed always improve after quantization?

Not necessarily. After quantization, model size decreases and memory bandwidth requirements are reduced, but additional dequantization computation is needed. On GPUs, INT8/INT4 quantization usually improves speed; on CPUs, GGUF format quantized models show more significant speed improvements. Actual results need to be tested based on hardware and model.

How to evaluate quantized model quality?

Common evaluation methods include: perplexity testing, downstream task benchmarks (such as MMLU, HellaSwag), and manual evaluation of output quality. It's recommended to prepare a test set related to your actual application scenario and compare output differences before and after quantization.

Summary

Model quantization is the key technology for deploying large language models in resource-constrained environments. Through this guide, you have learned:

Quantization Principles: Reduce model size and computational requirements by lowering numerical precision
Quantization Types: FP16, BF16, INT8, INT4 each have their suitable scenarios
PTQ vs QAT: Post-training quantization is quick and convenient, quantization-aware training has higher accuracy
Mainstream Methods: GPTQ for accuracy, AWQ for speed, GGUF for CPU deployment
Practical Code: Complete usage examples for llama.cpp and bitsandbytes
Deployment Solutions: Choose appropriate deployment architecture based on concurrency requirements and hardware environment

By mastering model quantization technology, you can deploy powerful AI capabilities on limited hardware resources, truly bringing large language models to everyone.

Previous:What is RLHF? How ChatGPT Learns from Human Feedback

Next:What is Ollama? Advanced Guide to Local LLM Deployment & Modelfile