What is PagedAttention?

PagedAttention is an LLM serving technique that manages key-value cache memory in fixed-size blocks, similar to virtual memory paging, to reduce waste and fragmentation.

How It Works

PagedAttention was popularized by vLLM as a way to make LLM serving more memory efficient. In autoregressive decoding, each active sequence needs KV cache memory that grows with generated tokens. Naive allocation can waste large amounts of GPU memory because sequence lengths vary and reservations are often over-provisioned. PagedAttention splits KV cache into blocks and maps logical sequence positions to physical blocks, allowing more flexible allocation, sharing, and eviction. This makes higher concurrency and better throughput possible under memory pressure.

Key Characteristics

  • Manages KV cache in fixed-size blocks instead of large contiguous reservations
  • Reduces memory waste caused by variable sequence lengths
  • Improves serving concurrency when KV cache is the limiting resource
  • Enables efficient scheduling for continuous batching workloads
  • Associated with vLLM but conceptually useful for LLM serving systems broadly

Common Use Cases

  1. Serving many concurrent LLM requests on limited GPU memory
  2. Reducing KV cache fragmentation in chat workloads
  3. Improving throughput for long-output generation
  4. Supporting continuous batching in production inference
  5. Benchmarking memory-efficient LLM serving engines

Example

loading...
Loading code...

Frequently Asked Questions

Why does KV cache memory waste happen?

Requests have different sequence lengths, and naive allocation can reserve more contiguous memory than each request actually uses.

Is PagedAttention only for vLLM?

It is strongly associated with vLLM, but the broader idea of paged KV cache management can inform other serving systems.

Does PagedAttention improve model quality?

No. It is a serving efficiency technique. It should preserve model behavior while improving memory utilization and throughput.

When is PagedAttention most valuable?

It is most valuable when GPU memory and KV cache fragmentation limit concurrency or long-context serving.

Related Tools

Related Terms

Related Articles