What is Prefill?
Prefill is the LLM inference phase that processes the full input prompt in parallel and produces the initial key-value cache before token-by-token decoding begins.
How It Works
Prefill is the first major compute phase in autoregressive LLM inference. During prefill, the model reads all input tokens, computes attention over the prompt, and stores key-value states that the decode phase can reuse. Long prompts, large retrieved contexts, and many conversation turns make prefill expensive. This is why prompt length, context caching, RAG chunk selection, and batching strategy can dominate time to first token even when output generation is short.
Key Characteristics
- Processes the entire input prompt before the first generated token
- Builds the KV cache reused by later decoding steps
- Highly sensitive to input token count and context length
- Often the main contributor to TTFT for long prompts
- Can be optimized with context caching, prompt trimming, and efficient batching
Common Use Cases
- Diagnosing high time to first token for long-context prompts
- Estimating serving cost for RAG requests with many retrieved chunks
- Applying context caching for repeated system prompts or shared prefixes
- Designing prompt budgets for chat applications
- Separating input-processing latency from output-generation latency
Example
Loading code...Frequently Asked Questions
Why does prefill affect time to first token?
The model must process the input prompt and build reusable attention state before it can emit the first output token.
Is prefill parallelizable?
Prefill over input tokens is much more parallel than token-by-token decoding, but it can still be expensive for long contexts.
How can prefill cost be reduced?
Reduce unnecessary context, use better RAG selection, cache shared prefixes, and choose serving engines optimized for long prompts.
Is prefill the same as prompt tokenization?
No. Tokenization converts text to token IDs; prefill is the model forward pass over those input tokens.