Platform - Model Serving

Serving

In this session, our readings cover:

Required Readings:

Efficient Memory Management for Large Language Model Serving with PagedAttention

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

A Survey on Large Language Model Acceleration based on KV Cache Management

More reading:

Multiple system ML readings