Platform - Model Serving + PPO
- SlideDeck: W8.2-Model Serving-team6-t5
- Version: current
- Lead team: team-6
- Notes: vLLM, KV cache, model Serving survey
In this session, our readings cover:
Required Readings:
Efficient Memory Management for Large Language Model Serving with PagedAttention
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
- [Submitted on 12 Sep 2023]
- High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM’s source code is publicly available at this https URL
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
- covering: vLLM, continuous batching, chunked prefill, fair scheduling, KV cache management, and disaggregated serving,
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia
- In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.
- https://arxiv.org/pdf/2312.15234
A Survey on Large Language Model Acceleration based on KV Cache Management
- URL
- [Submitted on 27 Dec 2024 (v1), last revised 2 Jan 2025 (this version, v2)]
- Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen
- Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications.
PPO readings
a simple blogpost: Preference Tuning LLMs: PPO, DPO, GRPO — A Simple Guide
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
- [Submitted on 23 Jul 2024]
- Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James)Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire)Cheng
- With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
- [Submitted on 20 May 2024 (v1), last revised 24 Nov 2024 (this version, v4)]
- Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, Yu Cao
- As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training large language models poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF’s code is available at \url{this https URL}.
More reading:
Multiple system ML readings
- [Scheduling] Chunked Prefill (OSDI’24): This is perhaps the most widely adopted scheduling policy in today’s LLM serving systems, which proposes a simple, straightforward idea but works very well. Since it is optimized from Continuous Batching (OSDI’22).
- [Disaggregated Serving] Splitwise (ISCA’24) / DistServe (OSDI’24): These two papers share a similar idea, separating prefill/decode across different nodes based on stage-specific characteristics. These are also intuitive ideas and are being merged into vLLM.
- [KV Cache, Tooling] SGLang (NIPS’24): It is a widely used serving framework, an alternative to vLLM. Or, it is more like a programming language tailored to LLM application developers, greatly simplifying the code they need to write. At the core of it is RadixAttention designed for efficient KV cache reuse.
- [Disaggregated Serving] Helix (ASPLOS’25): This proposes an optimized LLM sharding strategy in a heterogenous cluster to achieve optimal resource allocation.- Disaggregated Serving] ServerlessLLM (OSDI’24): This proposes an efficient live migration of LLM inference on the cloud without losing efficiency.
- [Scheduling] SJF (NIPS’24): This proposes a statistics-based online algorithm to approximate shortest-job-first scheduling in online LLM inference.
- [Offloading] FlexGen (ICML’23): This proposes the first offloading strategy specifically for inference systems.