What is Latency?
Latency is the elapsed time between a request and a response or milestone in an AI system, such as first token, final token, or completed tool result.
How It Works
Latency is the user's waiting time, but in LLM systems it has several layers. A request may spend time in routing, admission control, queueing, tokenization, prefill, decoding, safety checks, tool calls, retrieval, and network transfer. Reporting only an average hides tail behavior; p95 and p99 latency often matter more for product reliability. Latency must be interpreted with throughput because heavy batching can increase individual wait time even as system capacity improves.
Key Characteristics
- Measures elapsed time for a request, stream milestone, or completed operation
- Includes model compute and non-model overhead when measured end to end
- Should be reported with percentiles such as p50, p95, and p99
- Affected by prompt length, output length, batching, queueing, hardware, and network
- Trades off against throughput, cost, and sometimes answer quality
Common Use Cases
- Tracking end-to-end response time for an AI assistant
- Separating TTFT from total completion latency
- Monitoring p95 and p99 regressions after a deployment
- Setting SLOs for retrieval, tool calls, and model generation
- Comparing serving configurations under realistic load
Example
Loading code...Frequently Asked Questions
Why are p95 and p99 latency important?
They show tail behavior. A service with good average latency can still feel unreliable if many users hit slow outliers.
Is latency the same as TTFT?
No. TTFT is one latency milestone. Total latency measures when the whole response or operation completes.
What increases LLM latency?
Long prompts, long outputs, queueing, tool calls, retrieval, cold starts, large models, and inefficient batching can all increase latency.
How should latency be optimized?
Measure the full breakdown first, then reduce unnecessary context, tune batching, cache repeated work, optimize routing, and set clear SLOs.