Jump to content

PagedAttention

From Wikipedia, the free encyclopedia

PagedAttention is an attention algorithm for efficient serving of large language models (LLMs). It was introduced in 2023 by Woosuk Kwon and colleagues in the paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[1] alongside the vLLM serving engine. The method stores the key–value cache used during autoregressive decoding in fixed-size blocks that can be mapped to non-contiguous physical memory, borrowing ideas from virtual memory, paging, and operating system design.[2][3]

Background

[edit]

In transformer inference, the key–value cache grows with sequence length and the number of concurrent requests. Kwon et al. argued that earlier serving systems typically reserved contiguous cache regions in advance, which caused reserved space, internal fragmentation, and external fragmentation. In their experiments, the paper reported that the effective memory utilization of previous systems could fall as low as 20.4%.[2]

Description

[edit]

PagedAttention partitions the cache of each sequence into fixed-size KV blocks. A request's cache is represented as a sequence of logical blocks, while a block table maps those logical blocks to physical GPU-memory blocks. As a result, neighboring logical blocks do not need to be contiguous in physical memory, and new blocks can be allocated on demand as generation proceeds.[2]

The design also makes it easier to share cache state across related decoding paths. In vLLM, physical blocks can be reference-counted and shared among requests or branches, with block-granularity copy-on-write used when a shared block must be modified. The original paper applied this design to parallel sampling, beam search, and prompts with shared prefixes.[2]

Mathematical formulation

[edit]

For a query token in causal self-attention, the standard attention output can be written as where , , and are the query, key, and value vectors, and is the attention dimension.[2]

If the cache is partitioned into blocks of size , the key and value blocks may be written as PagedAttention then performs the computation blockwise: where is the vector of attention scores for the -th KV block. In the formulation given by Kwon et al., this preserves the causal attention calculation while allowing the key and value blocks to reside in non-contiguous physical memory.[2]

Performance and use

[edit]

The vLLM paper reported that, on its evaluated workloads, the use of PagedAttention and the associated memory-management design improved serving throughput by 2–4× over the compared baselines, including FasterTransformer and Orca, while preserving model outputs. In experiments on OPT-13B with the Alpaca trace, the paper also reported memory savings of 6.1–9.8% for parallel sampling and 37.6–55.2% for beam search through KV-block sharing.[2]

A 2024 survey of LLM serving systems described PagedAttention as having become an industry norm in LLM serving frameworks, citing support in TGI, vLLM, and TensorRT-LLM.[3]

Limitations and alternatives

[edit]

Subsequent work has described trade-offs in the approach. The 2025 vAttention paper argued that PagedAttention requires attention kernels to be rewritten to support paging and increases software complexity, portability issues, redundancy, and execution overhead, proposing instead a memory manager that keeps the cache contiguous in virtual memory while relying on demand paging for physical allocation.[4]

vAttention

[edit]

Unlike PagedAttention, vAttention does not introduce a different attention rule; it retains the standard attention computation In the notation of Prabhu et al., the key and value tensors for a request seen so far are , where is the context length seen so far, is the number of KV heads on a worker, and is the dimension of each KV head.[4]

In systems prior to PagedAttention, the K cache (or V cache) at each layer of a worker is typically allocated as a 4D tensor of shape where is batch size and is the maximum context length supported by the model.[4]

vAttention preserves this contiguous virtual-memory view while deferring physical-memory allocation to runtime. A serving framework maintains separate K and V tensors for each layer, so vAttention reserves virtual-memory buffers on a worker, where is the number of layers managed by that worker.[4]

The maximum size of one virtual-memory buffer is where is the maximum size of a single request's per-layer K cache (or V cache) on a worker. The paper defines where is the number of bytes needed to store one element.[4]

In this formulation, vAttention keeps the KV cache contiguous in virtual memory and relies on demand paging for physical allocation, rather than modifying the attention kernel to operate over non-contiguous KV-cache blocks.[4]

See also

[edit]

References

[edit]
  1. ^ Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].
  2. ^ a b c d e f g Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". Proceedings of the 29th Symposium on Operating Systems Principles. Association for Computing Machinery. pp. 611–626. arXiv:2309.06180. doi:10.1145/3600006.3613165.
  3. ^ a b Li, Baolin; Jiang, Yankai; Gadepally, Vijay; Tiwari, Devesh (July 17, 2024). "LLM Inference Serving: Survey of Recent Advances and Opportunities". arXiv:2407.12391 [cs.DC].
  4. ^ a b c d e f Prabhu, Ramya; Nayak, Ajay; Mohan, Jayashree; Ramjee, Ramachandran; Panwar, Ashish (2025). "vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention". Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. Association for Computing Machinery. pp. 1133–1150. doi:10.1145/3669940.3707256.