What is AWQ?
AWQ (Activation-aware Weight Quantization) is a weight-only quantization method that identifies and preserves salient weights by analyzing activation distributions rather than weight magnitudes alone, achieving state-of-the-art accuracy at INT4 precision while enabling efficient large language model deployment.
Quick Facts
| Full Name | Activation-aware Weight Quantization |
|---|---|
| Created | 2023 by Ji Lin et al. (MIT) |
How It Works
AWQ addresses a key insight in model quantization: not all weights are equally important. By observing the activation patterns of a small calibration dataset, AWQ identifies which weight channels are most critical for maintaining model quality. Instead of protecting these salient weights by keeping them in higher precision (which would create mixed-precision overhead), AWQ applies per-channel scaling that mathematically reduces the quantization error for important weights. This elegant approach achieves better accuracy than naive round-to-nearest quantization and competing methods like GPTQ, while being hardware-friendly and requiring no backpropagation or complex optimization during quantization.
Key Characteristics
- Weight-only quantization preserving activation-aware salient channels
- Per-channel scaling to minimize quantization error for important weights
- No backpropagation or retraining required during quantization
- Hardware-friendly INT4 format compatible with GPU kernels
- Requires only a small calibration dataset for activation analysis
- Achieves superior accuracy compared to round-to-nearest and GPTQ at INT4
Common Use Cases
- Deploying large language models on consumer-grade GPUs
- Reducing inference cost in production LLM serving
- Edge deployment of AI models with limited memory
- Enabling faster inference throughput with INT4 GPU kernels
- Compressing open-source LLMs for local and offline usage
Example
Loading code...Frequently Asked Questions
What is AWQ and how does it work?
AWQ (Activation-aware Weight Quantization) is a weight-only quantization technique that preserves model accuracy at INT4 precision. It works by analyzing activation distributions from a small calibration dataset to identify which weight channels are most important. Rather than keeping salient weights at higher precision, AWQ applies per-channel scaling factors that mathematically reduce quantization error for critical weights, achieving better accuracy without mixed-precision overhead.
How does AWQ compare to GPTQ?
Both AWQ and GPTQ are popular INT4 weight quantization methods, but they differ in approach. GPTQ uses an approximate second-order method (based on OBS) to minimize layer-wise reconstruction error through weight rounding optimization. AWQ instead focuses on protecting salient weights via activation-aware scaling. In practice, AWQ often achieves slightly better perplexity scores, is faster to quantize (no backpropagation needed), and produces more hardware-friendly quantized formats.
What performance improvement does AWQ provide?
AWQ typically reduces model size by approximately 3-4x (from FP16 to INT4) while maintaining near-original accuracy. Inference speed improvements depend on the hardware and serving framework, but AWQ INT4 models commonly achieve 2-3x faster throughput compared to FP16 on GPUs with INT4 kernel support. Memory savings enable running larger models on the same hardware or serving more concurrent users.
What calibration data does AWQ require?
AWQ requires only a small calibration dataset to analyze activation distributions and identify salient weight channels. Typically, a few hundred samples from a general-purpose text corpus (like a subset of C4 or Pile) are sufficient. The calibration process is fast and does not involve any gradient computation or backpropagation, making AWQ significantly quicker to apply than methods requiring optimization.
Which frameworks support AWQ models?
AWQ models are widely supported across the LLM inference ecosystem. Major frameworks include vLLM (high-throughput serving), TensorRT-LLM (NVIDIA optimized inference), Hugging Face Transformers (via AutoAWQ integration), llama.cpp (CPU and edge deployment), and text-generation-inference (TGI). Pre-quantized AWQ models for popular LLMs are readily available on Hugging Face Hub.