What is Tokens per Second?
Tokens per Second is a throughput metric that measures how many output tokens an LLM generates per second during the decode phase.
How It Works
Tokens per second, often abbreviated TPS, is the most visible speed metric after a response starts streaming. It captures generation speed during decoding, but it should not be confused with TTFT or total system throughput. A model can have excellent tokens per second but poor user experience if TTFT is high; it can also produce high aggregate throughput while each individual stream feels slow. Accurate reporting should specify whether TPS is measured per request, per GPU, per batch, or across the whole service.
Key Characteristics
- Measures output generation speed during decoding
- Can be reported per request, per user stream, per GPU, or service-wide
- Affected by model size, hardware, quantization, batching, and output length
- Does not include the full cost of prefill or queueing unless explicitly stated
- Should be interpreted together with TTFT, latency, and throughput
Common Use Cases
- Comparing model serving engines under the same workload
- Estimating how fast users will see streamed text
- Benchmarking quantized models against full-precision models
- Tracking decode performance after enabling continuous batching
- Monitoring production regressions in generation speed
Example
Loading code...Frequently Asked Questions
Is higher tokens per second always better?
Not by itself. Users also care about TTFT, total latency, answer quality, and whether speed is measured per request or service-wide.
Does TPS include input prompt processing?
Usually no. TPS often measures decode speed only, so prefill and queueing should be reported separately.
Why can TPS vary between requests?
It varies with output length, active batch size, hardware load, KV cache pressure, and sampling settings.
How should TPS benchmarks be reported?
Report model, hardware, batch size, input length, output length, precision, serving engine, and whether TPS is per stream or aggregate.