What is Tokens per Second?

Tokens per Second is a throughput metric that measures how many output tokens an LLM generates per second during the decode phase.

How It Works

Tokens per second, often abbreviated TPS, is the most visible speed metric after a response starts streaming. It captures generation speed during decoding, but it should not be confused with TTFT or total system throughput. A model can have excellent tokens per second but poor user experience if TTFT is high; it can also produce high aggregate throughput while each individual stream feels slow. Accurate reporting should specify whether TPS is measured per request, per GPU, per batch, or across the whole service.

Key Characteristics

Measures output generation speed during decoding
Can be reported per request, per user stream, per GPU, or service-wide
Affected by model size, hardware, quantization, batching, and output length
Does not include the full cost of prefill or queueing unless explicitly stated
Should be interpreted together with TTFT, latency, and throughput

Common Use Cases

Comparing model serving engines under the same workload
Estimating how fast users will see streamed text
Benchmarking quantized models against full-precision models
Tracking decode performance after enabling continuous batching
Monitoring production regressions in generation speed

Example

Loading code...

Frequently Asked Questions

Is higher tokens per second always better?

Not by itself. Users also care about TTFT, total latency, answer quality, and whether speed is measured per request or service-wide.

Does TPS include input prompt processing?

Usually no. TPS often measures decode speed only, so prefill and queueing should be reported separately.

Why can TPS vary between requests?

It varies with output length, active batch size, hardware load, KV cache pressure, and sampling settings.

How should TPS benchmarks be reported?

Report model, hardware, batch size, input length, output length, precision, serving engine, and whether TPS is per stream or aggregate.

Related Tools

JSON Formatter

Format, beautify, validate and minify JSON online for free. Features syntax highlighting, tree view, history tracking, and one-click copy. No signup required. 100% client-side processing for privacy.

AI Websites Directory

An authoritative, comprehensive, and continuously updated AI resources directory. It covers global and domestic model providers, open-source ecosystems, research indexes and leaderboards, developer platforms, and curated tool catalogs—helping you quickly discover, compare, and choose the right AI products and references. Supports keyword search and favorites, with clear category sections and an expanding dataset for better experience.

Text Analyzer

Free online text analyzer tool. Count words, characters, sentences, paragraphs. Calculate reading time, speaking time, and analyze word frequency. All processing happens in your browser.

What is Tokens per Second?

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions

Is higher tokens per second always better?

Does TPS include input prompt processing?

Why can TPS vary between requests?

How should TPS benchmarks be reported?

Related Tools

JSON Formatter

AI Websites Directory

Text Analyzer

Related Terms

Decode Phase

TTFT

Throughput

Latency

Related Articles

LLM Inference Complete Guide [2026]: From Tokenization and KV Cache to Text Generation