What is KV Cache?
KV Cache (Key-Value Cache) is an optimization technique used in Transformer-based model inference that stores previously computed Key and Value matrices from the attention mechanism, eliminating redundant calculations during autoregressive token generation and dramatically improving inference speed.
Quick Facts
| Full Name | Key-Value Cache |
|---|---|
| Created | Introduced as part of the Transformer architecture by Vaswani et al., 2017 |
How It Works
KV Cache is a fundamental optimization in modern large language model inference. During autoregressive generation, each new token requires attending to all previous tokens through the self-attention mechanism. Without caching, this would require recomputing the Key and Value projections for every previous token at each generation step, resulting in quadratic computational cost. KV Cache solves this by storing the Key and Value matrices from all previously processed tokens, so only the new token's projections need to be computed. This reduces the per-step computation from O(n²) to O(n), enabling practical deployment of LLMs at scale.
Key Characteristics
- Caches Key and Value matrices from all previous tokens in attention layers
- Eliminates redundant computation during autoregressive generation
- Memory usage grows linearly with sequence length and model depth
- Requires careful memory management for long-context scenarios
- Compatible with various quantization techniques to reduce cache size
- Supports advanced variants like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
Common Use Cases
- Accelerating LLM text generation in production inference servers
- Reducing latency in real-time conversational AI applications
- Enabling efficient batch inference with continuous batching
- Long-context processing with memory-efficient KV cache compression
- Serving multiple concurrent users with shared prefix caching
Example
Loading code...Frequently Asked Questions
What is KV Cache in Transformer models?
KV Cache (Key-Value Cache) is an inference optimization that stores the Key and Value matrices computed during previous generation steps. In autoregressive text generation, instead of recomputing attention keys and values for all previous tokens at each step, the cached values are reused. This avoids redundant computation and significantly speeds up token generation.
How does KV Cache affect memory usage?
KV Cache memory grows linearly with sequence length, number of layers, and the number of attention heads. For large models with long contexts, KV cache can consume significant GPU memory. For example, a 70B parameter model with 128K context length may require tens of gigabytes just for the KV cache. Techniques like GQA, MQA, and KV cache quantization help reduce this memory footprint.
What are Multi-Query Attention and Grouped-Query Attention?
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) are architectural modifications that reduce KV cache size. MQA shares a single Key-Value head across all query heads, while GQA groups multiple query heads to share fewer Key-Value heads. These techniques significantly reduce KV cache memory usage with minimal impact on model quality, and are widely adopted in modern LLMs like Llama 2 and Mistral.
Can KV Cache be quantized to save memory?
Yes, KV cache quantization is an active area of optimization. Techniques like INT8 or INT4 quantization of cached Key and Value tensors can reduce memory usage by 2-4x with minimal quality degradation. Frameworks like vLLM and TensorRT-LLM support KV cache quantization, making it practical to serve longer contexts and more concurrent users on the same hardware.
What is prefix caching and how does it relate to KV Cache?
Prefix caching is a technique where the KV cache for a common prompt prefix is computed once and shared across multiple requests. This is especially useful for applications where many users share the same system prompt or context. By avoiding redundant computation of the shared prefix, prefix caching significantly improves throughput and reduces time-to-first-token in production serving scenarios.