TL;DR
Large Language Model (LLM) inference is the process of generating text from a prompt. It consists of two main phases: the parallel Prefill phase (processing the prompt) and the sequential Decode phase (generating tokens one by one). Techniques like KV Cache are crucial for optimizing memory and speeding up the decoding process.
📋 Table of Contents
- What is LLM Inference?
- How LLM Inference Works: The Two Phases
- Understanding KV Cache
- Key Performance Metrics (TTFT & TPOT)
- LLM Inference in Practice
- Best Practices for Optimizing Inference
- FAQ
- Summary
✨ Key Takeaways
- Autoregressive Generation: LLMs generate text one token at a time, using previously generated tokens as context for the next one.
- Two-Phase Execution: Inference is split into a compute-bound Prefill phase and a memory-bound Decode phase.
- KV Cache is Essential: Storing past Key and Value matrices reduces computational overhead but increases VRAM usage.
- Metrics Matter: Optimizing TTFT improves user perception, while TPOT dictates the reading speed.
💡 Quick Tool: Token Counter — Quickly calculate how many tokens your prompt consumes to estimate inference costs and latency.
What is LLM Inference?
LLM inference is the operational phase where a trained Large Language Model receives an input text (prompt) and predicts the most likely continuation. Unlike the training phase, which updates model weights using massive datasets, inference uses frozen weights to perform forward passes.
Modern LLMs like GPT-4, Llama 3, and DeepSeek operate autoregressively. This means they predict the next token based on all previous tokens, append the new token to the sequence, and repeat the process until a stop condition (like an <EOS> token) is met.
📝 Glossary: Token — The fundamental unit of data processed by an LLM, which can be a word, a part of a word, or a single character.
How LLM Inference Works: The Two Phases
The inference process of a Transformer-based LLM is divided into two distinct phases: Prefill and Decode.
1. The Prefill Phase (Prompt Processing)
When you send a prompt to an LLM, it first enters the Prefill phase. In this stage, the model processes the entire input sequence in parallel.
Because the input tokens are already known, the model can utilize highly optimized matrix multiplications (GEMM) across the GPUs. The goals of the Prefill phase are to:
- Compute the attention scores for the input prompt.
- Populate the initial KV Cache.
- Generate the very first output token.
This phase is typically compute-bound, meaning it is limited by the raw processing power (FLOPs) of the GPU.
2. The Decode Phase (Token Generation)
Once the first token is generated, the model enters the Decode phase. Here, it generates tokens one by one.
For each new token, the model needs to attend to all previously generated tokens. Because this process is sequential, the GPU cannot parallelize it across the sequence dimension as it did in the Prefill phase. Instead, it relies on matrix-vector multiplications (GEMV).
The Decode phase is memory-bandwidth bound. The GPU spends most of its time moving model weights and KV Cache data from HBM (VRAM) to the compute cores.
Understanding KV Cache
To generate the next token, the attention mechanism needs the Key (K) and Value (V) representations of all past tokens. Recalculating these representations from scratch for every new token would be incredibly inefficient and scale quadratically with sequence length ($O(N^2)$).
KV Cache solves this by storing the K and V tensors of past tokens in GPU memory.
- Without KV Cache: Every generation step recalculates $K$ and $V$ for the entire history.
- With KV Cache: The model only computes $K$ and $V$ for the newest token, appends them to the cache, and performs attention using the cached history.
While KV Cache drastically reduces computation, it consumes a massive amount of VRAM. For a 100k-token context, the KV Cache alone can take tens of gigabytes, making memory management (e.g., PagedAttention in vLLM) critical for high-throughput inference servers.
Key Performance Metrics (TTFT & TPOT)
When evaluating LLM inference engines, two metrics are paramount:
| Metric | Full Name | Definition | User Impact |
|---|---|---|---|
| TTFT | Time To First Token | The latency between sending the request and receiving the first generated token. | Determines the perceived "responsiveness" of the AI. High TTFT feels laggy. |
| TPOT | Time Per Output Token | The average time it takes to generate each subsequent token during the Decode phase. | Determines the reading speed. If TPOT is 50ms, the model generates 20 tokens/sec. |
LLM Inference in Practice
Here is a simple example of how autoregressive inference works under the hood, using pseudo-code:
import torch
def generate_text(model, tokenizer, prompt, max_tokens=50):
# Tokenize input
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# KV Cache initialization
past_key_values = None
generated_tokens = []
for _ in range(max_tokens):
# Forward pass
with torch.no_grad():
outputs = model(
input_ids=input_ids,
past_key_values=past_key_values,
use_cache=True
)
# Get the next token (argmax for greedy decoding)
next_token_logits = outputs.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
# Append to output
generated_tokens.append(next_token.item())
# Update KV Cache for the next iteration
past_key_values = outputs.past_key_values
# The new input is just the single generated token
input_ids = next_token.unsqueeze(0)
# Stop condition
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated_tokens)
🔧 Try it now: Use our free JSON Formatter to parse and visualize the API responses returned by LLM inference engines.
Best Practices for Optimizing Inference
- Use PagedAttention: Frameworks like vLLM use PagedAttention to manage KV Cache memory dynamically, eliminating memory fragmentation and increasing batch sizes by up to 5x.
- Apply Model Quantization: Reduce weights from FP16 to INT8 or INT4 (e.g., using AWQ or GPTQ) to lower VRAM requirements and increase memory bandwidth speed during the Decode phase.
- Enable Continuous Batching: Instead of waiting for all requests in a batch to finish, continuously inject new requests into the batch as soon as others complete.
- Optimize Prompt Length: Because the Prefill phase scales quadratically with prompt length, keeping context concise reduces TTFT and initial compute costs.
⚠️ Common Mistakes:
- Ignoring KV Cache Memory → Fix: Always calculate maximum KV Cache size before deploying. Out-of-memory (OOM) errors during generation are usually caused by an overflowing KV Cache, not model weights.
- Using naive Hugging Face
pipelinefor production → Fix: Use dedicated inference servers like vLLM, TGI, or TensorRT-LLM which implement continuous batching and PagedAttention.
FAQ
Q1: Why does generating text get slower as the output gets longer?
As the generated sequence grows, the KV Cache size increases. During the Decode phase, the GPU must read the entire KV Cache from memory for every single token. This memory bandwidth bottleneck causes the generation speed (TPOT) to degrade slightly as context length increases.
Q2: What is Continuous Batching (In-flight Batching)?
In traditional static batching, the GPU waits for the longest sequence in a batch to finish before starting a new batch. Continuous batching dynamically adds new requests to the batch and removes finished ones at the token level, drastically improving GPU utilization.
Q3: How do Speculative Decoding and KV Cache relate?
Speculative Decoding uses a smaller "draft" model to guess multiple future tokens, and the main LLM verifies them in a single parallel pass. This shifts the workload from memory-bound decoding to compute-bound verification, effectively speeding up inference without changing the KV Cache structure.
Summary
LLM inference is a complex dance between compute and memory. By understanding the distinction between the parallel Prefill phase and the sequential Decode phase, and leveraging KV Cache optimization techniques, developers can drastically reduce latency and serving costs.
👉 Explore QubitTool Developer Tools — Enhance your AI development workflow with our suite of free utilities.