What is Speculative Decoding?

Speculative Decoding is an LLM inference technique where a faster draft model proposes multiple candidate tokens and a larger target model verifies them in parallel.

How It Works

Speculative decoding targets the sequential bottleneck of autoregressive generation. Instead of asking the large model to produce every token one by one, a smaller or cheaper draft model proposes several tokens ahead. The target model then verifies those tokens, accepting a prefix that matches its own distribution and rejecting the rest when necessary. When the draft model is accurate enough, this can reduce decode latency without changing the target model's output distribution. The practical gains depend on draft quality, verification overhead, batching, hardware, and sampling settings.

Key Characteristics

  • Uses a draft model or draft mechanism to propose future tokens
  • Lets the target model verify multiple proposed tokens in fewer forward passes
  • Can preserve the target model distribution when implemented correctly
  • Most useful when decode is the bottleneck and draft acceptance is high
  • Adds engineering complexity around sampling, batching, and fallback behavior

Common Use Cases

  1. Reducing latency for high-volume chat completion workloads
  2. Accelerating long-form generation when output length dominates cost
  3. Testing draft models for the same model family or domain
  4. Improving serving efficiency without changing the public target model
  5. Benchmarking decode optimizations beyond continuous batching

Example

loading...
Loading code...

Frequently Asked Questions

Does speculative decoding change model quality?

A correct implementation can preserve the target model distribution, but real systems must verify sampling behavior carefully.

When does speculative decoding help most?

It helps when decode latency dominates and the draft model proposes tokens that the target model accepts frequently.

Is the draft model always smaller?

Usually, but draft tokens can also come from specialized heads, n-gram methods, or other cheaper proposal mechanisms.

What limits speculative decoding speedups?

Low acceptance rates, verification overhead, memory pressure, batching interactions, and hardware utilization can limit gains.

Related Tools

Related Terms