Agent Efficiency - Model Serving + PPO

Customization Serving

In this session, our readings cover:

Required Readings:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Efficient Transformers: A Survey

PPO Readings:

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

More Readings:

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning