Model Serving for Agents

Efficiency

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic Slide Deck Previous Semester
Platform - Model Serving W8.2-Model Serving-team6-t5 25course
More Model Serving - SGlang + Chunked Prefill W12.2-Model-Serving 25course
Model Serving - Efficiency Inference W14.2.ModelServing 25course
Model Interpretability for FM W13.2-GenAI-Interpretability 25course
LLM Interpretability, Trust and Knowledge Conflicts W10-T6-LLMInterpretibility 24course

Multiple system ML readings

Auditing Prompt Caching in Language Model APIs

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing