What is Context Caching?
Context Caching is the practice of reusing repeated prompt context or computed model state so an LLM service does not recompute the same input tokens for every request.
How It Works
Context caching is valuable when many requests share a long prefix, such as a system prompt, policy document, tool specification, few-shot examples, or fixed RAG context. Instead of paying full prefill cost every time, the serving layer can reuse cached tokenization, prompt processing, or KV cache state depending on the platform. This can improve TTFT and cost for long prompts, but cache correctness depends on exact prefix matching, model version, tokenizer, permissions, and invalidation rules. A stale or improperly shared cache can create privacy and correctness risks.
Key Characteristics
- Reuses repeated context to reduce redundant prefill work
- Most useful for long shared prefixes and high request reuse
- Can improve TTFT, throughput, and serving cost
- Requires strict cache keys for model, tokenizer, prompt prefix, and permissions
- Needs invalidation when source content, policy, or model version changes
Common Use Cases
- Caching long system prompts shared by all users
- Reusing tool schemas or policy text across agent requests
- Reducing RAG latency when a document set is repeatedly queried
- Improving TTFT for assistants with stable instruction blocks
- Separating public cacheable context from user-private context
Example
Loading code...Frequently Asked Questions
What does context caching save?
It can save repeated tokenization, prefill computation, or KV cache construction depending on the serving platform.
When does context caching help most?
It helps when many requests share a long stable prefix and only a smaller suffix changes per user request.
Can context caching leak data?
Yes. Cache keys and permission boundaries must prevent private context from being reused across unauthorized users.
Why does model version matter for context caching?
Cached model state is only valid for the same model, tokenizer, and compatible serving configuration.