What is Latency?

Latency is the elapsed time between a request and a response or milestone in an AI system, such as first token, final token, or completed tool result.

How It Works

Latency is the user's waiting time, but in LLM systems it has several layers. A request may spend time in routing, admission control, queueing, tokenization, prefill, decoding, safety checks, tool calls, retrieval, and network transfer. Reporting only an average hides tail behavior; p95 and p99 latency often matter more for product reliability. Latency must be interpreted with throughput because heavy batching can increase individual wait time even as system capacity improves.

Key Characteristics

Measures elapsed time for a request, stream milestone, or completed operation
Includes model compute and non-model overhead when measured end to end
Should be reported with percentiles such as p50, p95, and p99
Affected by prompt length, output length, batching, queueing, hardware, and network
Trades off against throughput, cost, and sometimes answer quality

Common Use Cases

Tracking end-to-end response time for an AI assistant
Separating TTFT from total completion latency
Monitoring p95 and p99 regressions after a deployment
Setting SLOs for retrieval, tool calls, and model generation
Comparing serving configurations under realistic load

Example

Loading code...

Frequently Asked Questions

Why are p95 and p99 latency important?

They show tail behavior. A service with good average latency can still feel unreliable if many users hit slow outliers.

Is latency the same as TTFT?

No. TTFT is one latency milestone. Total latency measures when the whole response or operation completes.

What increases LLM latency?

Long prompts, long outputs, queueing, tool calls, retrieval, cold starts, large models, and inefficient batching can all increase latency.

How should latency be optimized?

Measure the full breakdown first, then reduce unnecessary context, tune batching, cache repeated work, optimize routing, and set clear SLOs.

Related Tools

JSON Formatter

Format, beautify, validate and minify JSON online for free. Features syntax highlighting, tree view, history tracking, and one-click copy. No signup required. 100% client-side processing for privacy.

AI Websites Directory

An authoritative, comprehensive, and continuously updated AI resources directory. It covers global and domestic model providers, open-source ecosystems, research indexes and leaderboards, developer platforms, and curated tool catalogs—helping you quickly discover, compare, and choose the right AI products and references. Supports keyword search and favorites, with clear category sections and an expanding dataset for better experience.

Text Analyzer

Free online text analyzer tool. Count words, characters, sentences, paragraphs. Calculate reading time, speaking time, and analyze word frequency. All processing happens in your browser.

What is Latency?

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions

Why are p95 and p99 latency important?

Is latency the same as TTFT?

What increases LLM latency?

How should latency be optimized?

Related Tools

JSON Formatter

AI Websites Directory

Text Analyzer

Related Terms

TTFT

Throughput

Prefill

Decode Phase

Related Articles

Voice AI Engineering [2026]: Low-Latency Agent Design

LLM Inference Complete Guide [2026]: From Tokenization and KV Cache to Text Generation