Model Serving for Agents

Notes: Agents with efficient model serving

Efficiency

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic	Slide Deck	Previous Semester
Platform - Model Serving	W8.2-Model Serving-team6-t5	25course
More Model Serving - SGlang + Chunked Prefill	W12.2-Model-Serving	25course
Model Serving - Efficiency Inference	W14.2.ModelServing	25course
Model Interpretability for FM	W13.2-GenAI-Interpretability	25course
LLM Interpretability, Trust and Knowledge Conflicts	W10-T6-LLMInterpretibility	24course

Multiple system ML readings

[Scheduling] Chunked Prefill (OSDI’24): This is perhaps the most widely adopted scheduling policy in today’s LLM serving systems, which proposes a simple, straightforward idea but works very well. Since it is optimized from Continuous Batching (OSDI’22).
[Disaggregated Serving] Splitwise (ISCA’24) / DistServe (OSDI’24): These two papers share a similar idea, separating prefill/decode across different nodes based on stage-specific characteristics. These are also intuitive ideas and are being merged into vLLM.
[KV Cache, Tooling] SGLang (NIPS’24): It is a widely used serving framework, an alternative to vLLM. Or, it is more like a programming language tailored to LLM application developers, greatly simplifying the code they need to write. At the core of it is RadixAttention designed for efficient KV cache reuse.
[Disaggregated Serving] Helix (ASPLOS’25): This proposes an optimized LLM sharding strategy in a heterogenous cluster to achieve optimal resource allocation.- Disaggregated Serving] ServerlessLLM (OSDI’24): This proposes an efficient live migration of LLM inference on the cloud without losing efficiency.
[Scheduling] SJF (NIPS’24): This proposes a statistics-based online algorithm to approximate shortest-job-first scheduling in online LLM inference.
[Offloading] FlexGen (ICML’23): This proposes the first offloading strategy specifically for inference systems.

Auditing Prompt Caching in Language Model APIs

[Submitted on 11 Feb 2025]
https://arxiv.org/abs/2502.07776
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.

2026 Spring UVA CS - GenAI-Overview

Model Serving for Agents

Readings: DEPLOYMENT & SERVING

Multiple system ML readings

Auditing Prompt Caching in Language Model APIs

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

2026 Spring UVA CS - GenAI-Overview

Model Serving for Agents

Readings: DEPLOYMENT & SERVING

2025 HIGH-IMPACT PAPERS on a related topic:

Multiple system ML readings

Auditing Prompt Caching in Language Model APIs

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing