Model Serving for Agents
- Notes: Agents with efficient model serving
In this session, our readings cover:
Readings: DEPLOYMENT & SERVING
Core Component: Production Infrastructure - Deploying and Serving Agents at Scale
Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Platform - Model Serving | W8.2-Model Serving-team6-t5 | 25course |
| More Model Serving - SGlang + Chunked Prefill | W12.2-Model-Serving | 25course |
| Model Serving - Efficiency Inference | W14.2.ModelServing | 25course |
| Model Interpretability for FM | W13.2-GenAI-Interpretability | 25course |
| LLM Interpretability, Trust and Knowledge Conflicts | W10-T6-LLMInterpretibility | 24course |
2025 HIGH-IMPACT PAPERS on a related topic:
Multiple system ML readings
- [Scheduling] Chunked Prefill (OSDI’24): This is perhaps the most widely adopted scheduling policy in today’s LLM serving systems, which proposes a simple, straightforward idea but works very well. Since it is optimized from Continuous Batching (OSDI’22).
- [Disaggregated Serving] Splitwise (ISCA’24) / DistServe (OSDI’24): These two papers share a similar idea, separating prefill/decode across different nodes based on stage-specific characteristics. These are also intuitive ideas and are being merged into vLLM.
- [KV Cache, Tooling] SGLang (NIPS’24): It is a widely used serving framework, an alternative to vLLM. Or, it is more like a programming language tailored to LLM application developers, greatly simplifying the code they need to write. At the core of it is RadixAttention designed for efficient KV cache reuse.
- [Disaggregated Serving] Helix (ASPLOS’25): This proposes an optimized LLM sharding strategy in a heterogenous cluster to achieve optimal resource allocation.- Disaggregated Serving] ServerlessLLM (OSDI’24): This proposes an efficient live migration of LLM inference on the cloud without losing efficiency.
- [Scheduling] SJF (NIPS’24): This proposes a statistics-based online algorithm to approximate shortest-job-first scheduling in online LLM inference.
- [Offloading] FlexGen (ICML’23): This proposes the first offloading strategy specifically for inference systems.
Auditing Prompt Caching in Language Model APIs
- [Submitted on 11 Feb 2025]
- https://arxiv.org/abs/2502.07776
- Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
- Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.
More Readings :
Orca: A Distributed Serving System for Transformer-Based Generative Models
- Continuous Batching: https://www.usenix.org/system/files/osdi22-yu.pdf
- Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University
- Large-scale Transformer-based models trained for generation tasks (e.g., GPT-3) have recently attracted huge interest, emphasizing the need for system support for serving models in this family. Since these models generate a next token in an autoregressive manner, one has to run the model multiple times to process an inference request where each iteration of the model generates a single output token for the request. However, existing systems for inference serving do not perform well on this type of workload that has a multi-iteration characteristic, due to their inflexible scheduling mechanism that cannot change the current batch of requests being processed; requests that have finished earlier than other requests in a batch cannot return to the client, while newly arrived requests have to wait until the current batch completely finishes. In this paper, we propose iteration-level scheduling, a new scheduling mechanism that schedules execution at the granularity of iteration (instead of request) where the scheduler invokes the execution engine to run only a single iteration of the model on the batch. In addition, to apply batching and iteration-level scheduling to a Transformer model at the same time, we suggest selective batching, which applies batching only to a selected set of operations. Based on these two techniques, we have implemented a distributed serving system called ORCA, with additional designs for scalability to models with hundreds of billions of parameters. Our evaluation on a GPT-3 175B model shows that ORCA can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36:9× throughput improvement at the same level of latency.
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
- FlexGen: https://arxiv.org/pdf/2303.06865 +[Submitted on 13 Mar 2023 (v1), last revised 12 Jun 2023 (this version, v2)]
- Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang
- The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at this https URL
Neo: https://arxiv.org/pdf/2411.01142
- [Submitted on 2 Nov 2024]
- NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
- Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
- Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5×, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.
Shortest Job First: https://arxiv.org/pdf/2408.15792
- [Submitted on 28 Aug 2024]
- Efficient LLM Scheduling by Learning to Rank
- Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang
- In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption – we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at this https URL
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
- [Submitted on 1 Aug 2024 (v1), last revised 14 Oct 2024 (this version, v2)]
- Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
- While the scaling laws of large language models (LLMs) training have been extensively studied, optimal inference configurations of LLMs remain underexplored. We study inference scaling laws and compute-optimal inference, focusing on the trade-offs between model sizes and generating additional tokens with different inference strategies. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-n, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets. Our findings indicate smaller models (e.g., Llemma-7B) can outperform larger models given the same computation budgets, and that smaller models paired with advanced inference algorithms yield Pareto-optimal cost-performance trade-offs. For instance, the Llemma-7B model, equipped with our novel tree search algorithm, consistently outperforms Llemma-34B with standard majority voting on the MATH benchmark across all FLOPs budgets. We hope these findings contribute to a broader understanding of inference scaling laws for LLMs.
Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing
- KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
- With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.
