More Model Serving Readings - SGlang + Chunked Prefill


In this session, our readings cover:

Readings on Efficient Model Serving :

SGLang: Efficient Execution of Structured Language Model Programs

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792