Model Serving for Agents

Efficiency

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic Slide Deck Previous Semester
Platform - Model Serving W8.2-Model Serving-team6-t5 25course
More Model Serving - SGlang + Chunked Prefill W12.2-Model-Serving 25course
Model Serving - Efficiency Inference W14.2.ModelServing 25course

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

More Readings :

Multiple system ML readings

Neo: https://arxiv.org/pdf/2411.01142

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing