Model serving - Efficiency + PPO

Customization Serving

Required Readings:

A Survey on Model Compression for Large Language Models

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption

PPO readings

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

more readings

Generative AI on the Edge: Architecture and Performance Evaluation