More Model Serving Readings - SGlang + Chunked Prefill


In this session, our readings cover:

Readings on Efficient Model Serving :

SGLang: Efficient Execution of Structured Language Model Programs

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

More Readings :

Splitwise: https://arxiv.org/pdf/2311.18677

+[Submitted on 30 Nov 2023 (v1), last revised 20 May 2024 (this version, v2)]

DistServe: https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792