What is KV Cache in LLM inference?

KV Cache (Key-Value Cache) is a memory optimization technique used during the Decode phase of LLM inference. It stores the previously computed Key and Value tensors for past tokens, avoiding redundant calculations and significantly speeding up token generation.

What is the difference between Prefill and Decode phases?

The Prefill phase processes the entire input prompt in parallel to compute the initial context and the first output token. The Decode phase generates subsequent tokens one by one sequentially, relying on the KV Cache from previous steps.

What do TTFT and TPOT mean?

TTFT (Time To First Token) measures the latency before the model outputs the first token. TPOT (Time Per Output Token) measures the average time taken to generate each subsequent token during the Decode phase.

LLM Inference Complete Guide [2026]: From Tokenization and KV Cache to Text Generation

2026-04-07 - QubitTool Tech Team

TL;DR

Large Language Model (LLM) inference is the process of generating text from a prompt. It consists of two main phases: the parallel Prefill phase (processing the prompt) and the sequential Decode phase (generating tokens one by one). Techniques like KV Cache are crucial for optimizing memory and speeding up the decoding process.

✨ Key Takeaways

Autoregressive Generation: LLMs generate text one token at a time, using previously generated tokens as context for the next one.
Two-Phase Execution: Inference is split into a compute-bound Prefill phase and a memory-bound Decode phase.
KV Cache is Essential: Storing past Key and Value matrices reduces computational overhead but increases VRAM usage.
Metrics Matter: Optimizing TTFT improves user perception, while TPOT dictates the reading speed.

💡 Quick Tool: Token Counter — Quickly calculate how many tokens your prompt consumes to estimate inference costs and latency.

What is LLM Inference?

LLM inference is the operational phase where a trained Large Language Model receives an input text (prompt) and predicts the most likely continuation. Unlike the training phase, which updates model weights using massive datasets, inference uses frozen weights to perform forward passes.

Modern LLMs like GPT-4, Llama 3, and DeepSeek operate autoregressively. This means they predict the next token based on all previous tokens, append the new token to the sequence, and repeat the process until a stop condition (like an <EOS> token) is met.

📝 Glossary: Token — The fundamental unit of data processed by an LLM, which can be a word, a part of a word, or a single character.

How LLM Inference Works: The Two Phases

The inference process of a Transformer-based LLM is divided into two distinct phases: Prefill and Decode.

1. The Prefill Phase (Prompt Processing)

When you send a prompt to an LLM, it first enters the Prefill phase. In this stage, the model processes the entire input sequence in parallel.

Because the input tokens are already known, the model can utilize highly optimized matrix multiplications (GEMM) across the GPUs. The goals of the Prefill phase are to:

Compute the attention scores for the input prompt.
Populate the initial KV Cache.
Generate the very first output token.

This phase is typically compute-bound, meaning it is limited by the raw processing power (FLOPs) of the GPU.

2. The Decode Phase (Token Generation)

Once the first token is generated, the model enters the Decode phase. Here, it generates tokens one by one.

For each new token, the model needs to attend to all previously generated tokens. Because this process is sequential, the GPU cannot parallelize it across the sequence dimension as it did in the Prefill phase. Instead, it relies on matrix-vector multiplications (GEMV).

The Decode phase is memory-bandwidth bound. The GPU spends most of its time moving model weights and KV Cache data from HBM (VRAM) to the compute cores.

Understanding KV Cache

To generate the next token, the attention mechanism needs the Key (K) and Value (V) representations of all past tokens. Recalculating these representations from scratch for every new token would be incredibly inefficient and scale quadratically with sequence length ($O(N^2)$).

KV Cache solves this by storing the K and V tensors of past tokens in GPU memory.

Without KV Cache: Every generation step recalculates $K$ and $V$ for the entire history.
With KV Cache: The model only computes $K$ and $V$ for the newest token, appends them to the cache, and performs attention using the cached history.

While KV Cache drastically reduces computation, it consumes a massive amount of VRAM. For a 100k-token context, the KV Cache alone can take tens of gigabytes, making memory management (e.g., PagedAttention in vLLM) critical for high-throughput inference servers.

Key Performance Metrics (TTFT & TPOT)

When evaluating LLM inference engines, two metrics are paramount:

Metric	Full Name	Definition	User Impact
TTFT	Time To First Token	The latency between sending the request and receiving the first generated token.	Determines the perceived "responsiveness" of the AI. High TTFT feels laggy.
TPOT	Time Per Output Token	The average time it takes to generate each subsequent token during the Decode phase.	Determines the reading speed. If TPOT is 50ms, the model generates 20 tokens/sec.

LLM Inference in Practice

Here is a simple example of how autoregressive inference works under the hood, using pseudo-code:

python

import torch

def generate_text(model, tokenizer, prompt, max_tokens=50):
    # Tokenize input
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # KV Cache initialization
    past_key_values = None
    
    generated_tokens = []
    
    for _ in range(max_tokens):
        # Forward pass
        with torch.no_grad():
            outputs = model(
                input_ids=input_ids, 
                past_key_values=past_key_values,
                use_cache=True
            )
            
        # Get the next token (argmax for greedy decoding)
        next_token_logits = outputs.logits[:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1)
        
        # Append to output
        generated_tokens.append(next_token.item())
        
        # Update KV Cache for the next iteration
        past_key_values = outputs.past_key_values
        
        # The new input is just the single generated token
        input_ids = next_token.unsqueeze(0)
        
        # Stop condition
        if next_token.item() == tokenizer.eos_token_id:
            break
            
    return tokenizer.decode(generated_tokens)

🔧 Try it now: Use our free JSON Formatter to parse and visualize the API responses returned by LLM inference engines.

Best Practices for Optimizing Inference

Use PagedAttention: Frameworks like vLLM use PagedAttention to manage KV Cache memory dynamically, eliminating memory fragmentation and increasing batch sizes by up to 5x.
Apply Model Quantization: Reduce weights from FP16 to INT8 or INT4 (e.g., using AWQ or GPTQ) to lower VRAM requirements and increase memory bandwidth speed during the Decode phase.
Enable Continuous Batching: Instead of waiting for all requests in a batch to finish, continuously inject new requests into the batch as soon as others complete.
Optimize Prompt Length: Because the Prefill phase scales quadratically with prompt length, keeping context concise reduces TTFT and initial compute costs.

⚠️ Common Mistakes:

Ignoring KV Cache Memory → Fix: Always calculate maximum KV Cache size before deploying. Out-of-memory (OOM) errors during generation are usually caused by an overflowing KV Cache, not model weights.
Using naive Hugging Face pipeline for production → Fix: Use dedicated inference servers like vLLM, TGI, or TensorRT-LLM which implement continuous batching and PagedAttention.

FAQ

Q1: Why does generating text get slower as the output gets longer?

As the generated sequence grows, the KV Cache size increases. During the Decode phase, the GPU must read the entire KV Cache from memory for every single token. This memory bandwidth bottleneck causes the generation speed (TPOT) to degrade slightly as context length increases.

Q2: What is Continuous Batching (In-flight Batching)?

In traditional static batching, the GPU waits for the longest sequence in a batch to finish before starting a new batch. Continuous batching dynamically adds new requests to the batch and removes finished ones at the token level, drastically improving GPU utilization.

Q3: How do Speculative Decoding and KV Cache relate?

Speculative Decoding uses a smaller "draft" model to guess multiple future tokens, and the main LLM verifies them in a single parallel pass. This shifts the workload from memory-bound decoding to compute-bound verification, effectively speeding up inference without changing the KV Cache structure.

Summary

LLM inference is a complex dance between compute and memory. By understanding the distinction between the parallel Prefill phase and the sequential Decode phase, and leveraging KV Cache optimization techniques, developers can drastically reduce latency and serving costs.

👉 Explore QubitTool Developer Tools — Enhance your AI development workflow with our suite of free utilities.

Previous:How Do Diffusion Models Work? DDPM to Stable Diffusion

Next:Mixture of Experts (MoE) Architecture Explained [2026]: GPT-4 and DeepSeek Core Tech