Model serving - Efficiency + PPO

Customization Serving

PPO readings

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Model Serve Readings:

Splitwise: https://arxiv.org/pdf/2311.18677

+[Submitted on 30 Nov 2023 (v1), last revised 20 May 2024 (this version, v2)]

DistServe: https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

more readings

Generative AI on the Edge: Architecture and Performance Evaluation

Efficient Transformers: A Survey

A Survey on Model Compression for Large Language Models

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption