What is Decode Phase?

Decode Phase is the LLM inference stage that generates output one token at a time using the KV cache created during prefill.

How It Works

The decode phase begins after prefill and continues until the model reaches a stop condition, output length limit, or streaming termination. Unlike prefill, decoding is sequential for each request because every new token depends on previously generated tokens. This makes decode performance central to tokens per second, streaming responsiveness, GPU memory pressure, and serving throughput. Techniques such as continuous batching, speculative decoding, efficient KV cache management, and optimized attention kernels all target decode bottlenecks.

Key Characteristics

Generates one output token at a time for each active request
Reuses KV cache from prefill and appends new cache entries
Often determines streaming speed and tokens per second
Memory-bound in many serving workloads because attention reads cached states
Optimized by batching, speculative decoding, and efficient cache management

Common Use Cases

Measuring output generation speed after the first token appears
Debugging slow streaming responses from an LLM service
Sizing GPU memory for long outputs and many concurrent users
Evaluating speculative decoding and continuous batching improvements
Separating prompt-processing latency from generation latency

Example

Loading code...

Frequently Asked Questions

Why is decoding slower than prefill per token?

Decoding is sequential for each request because every generated token depends on previous tokens, limiting parallelism.

Does decode phase affect TTFT?

TTFT includes the first decode step, but long decode latency mostly affects total response time and streaming speed.

What improves decode performance?

Continuous batching, efficient KV cache memory management, optimized attention kernels, and speculative decoding can all help.

Why does output length matter for decoding?

Each additional output token requires another decode step, so longer answers increase total generation latency.

Related Tools

AI Websites Directory

An authoritative, comprehensive, and continuously updated AI resources directory. It covers global and domestic model providers, open-source ecosystems, research indexes and leaderboards, developer platforms, and curated tool catalogs—helping you quickly discover, compare, and choose the right AI products and references. Supports keyword search and favorites, with clear category sections and an expanding dataset for better experience.

JSON Formatter

Format, beautify, validate and minify JSON online for free. Features syntax highlighting, tree view, history tracking, and one-click copy. No signup required. 100% client-side processing for privacy.

Text Analyzer

Free online text analyzer tool. Count words, characters, sentences, paragraphs. Calculate reading time, speaking time, and analyze word frequency. All processing happens in your browser.

Related Terms

Prefill

Prefill is the LLM inference phase that processes the full input prompt in parallel and produces the initial key-value cache before token-by-token decoding begins.

Tokens per Second

Tokens per Second is a throughput metric that measures how many output tokens an LLM generates per second during the decode phase.

KV Cache

KV Cache (Key-Value Cache) is an optimization technique used in Transformer-based model inference that stores previously computed Key and Value matrices from the attention mechanism, eliminating redundant calculations during autoregressive token generation and dramatically improving inference speed.