Model Serving for Agents

SlideDeck: 2026-SP-W6-survey-model-serve.pdf
Version: current
Notes: Agents with efficient model serving

Efficiency

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic	Slide Deck	Previous Semester
Platform - Model Serving	W8.2-Model Serving-team6-t5	25course
More Model Serving - SGlang + Chunked Prefill	W12.2-Model-Serving	25course
Model Serving - Efficiency Inference	W14.2.ModelServing	25course

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

17 Jun 2024 (this version, v3)]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at this https URL.

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

2026 Spring UVA CS - GenAI-Overview

Model Serving for Agents

Readings: DEPLOYMENT & SERVING

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

More Readings :

Multiple system ML readings

Neo: https://arxiv.org/pdf/2411.01142

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

2026 Spring UVA CS - GenAI-Overview

Model Serving for Agents

Readings: DEPLOYMENT & SERVING

2025 HIGH-IMPACT PAPERS on a related topic:

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

More Readings :

Multiple system ML readings

Neo: https://arxiv.org/pdf/2411.01142

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing