What is Decode Phase?

Decode Phase is the LLM inference stage that generates output one token at a time using the KV cache created during prefill.

How It Works

The decode phase begins after prefill and continues until the model reaches a stop condition, output length limit, or streaming termination. Unlike prefill, decoding is sequential for each request because every new token depends on previously generated tokens. This makes decode performance central to tokens per second, streaming responsiveness, GPU memory pressure, and serving throughput. Techniques such as continuous batching, speculative decoding, efficient KV cache management, and optimized attention kernels all target decode bottlenecks.

Key Characteristics

  • Generates one output token at a time for each active request
  • Reuses KV cache from prefill and appends new cache entries
  • Often determines streaming speed and tokens per second
  • Memory-bound in many serving workloads because attention reads cached states
  • Optimized by batching, speculative decoding, and efficient cache management

Common Use Cases

  1. Measuring output generation speed after the first token appears
  2. Debugging slow streaming responses from an LLM service
  3. Sizing GPU memory for long outputs and many concurrent users
  4. Evaluating speculative decoding and continuous batching improvements
  5. Separating prompt-processing latency from generation latency

Example

loading...
Loading code...

Frequently Asked Questions

Why is decoding slower than prefill per token?

Decoding is sequential for each request because every generated token depends on previous tokens, limiting parallelism.

Does decode phase affect TTFT?

TTFT includes the first decode step, but long decode latency mostly affects total response time and streaming speed.

What improves decode performance?

Continuous batching, efficient KV cache memory management, optimized attention kernels, and speculative decoding can all help.

Why does output length matter for decoding?

Each additional output token requires another decode step, so longer answers increase total generation latency.

Related Tools

Related Terms

Related Articles