Agents Optimization
- Notes: Agents Optimization
In this session, our readings cover:
Required Readings: MODEL TRAINING & OPTIMIZATION
Core Component: Improving the Agent Brain - Training, Fine-tuning, and Optimization
Techniques for improving model capabilities and efficiency.
Key Concepts: Evaluation frameworks, guardrails, alignment (RLHF, PPO, DPO), risk assessment, jailbreaking defense, fairness, bias mitigation, toxicity prevention, agent safety protocols Key Concepts: Data preparation, instruction tuning, LoRA/DoRA, parameter-efficient fine-tuning, scaling laws, efficiency optimization
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Platform - Model Customization (Instruction Tuning/LoRA) | W8.1-LoRA-Team5 | 25course |
| LLM Alignment - PPO | W11.2-team6-PPO | 25course |
| LLM Post-training | W14.3.DPO | 25course |
| Open Source LLM - Mistral Data Preparation | W4-OpenSourceLLM | 24course |
| Scaling Law and Efficiency | W11-ScalinglawEfficientLLM | 24course |
| LLM Fine Tuning | W14-LLM-FineTuning | 24course |
| Model Editing and Disgorgement | W10-T5-ModelEditing | 24course |
2025 HIGH-IMPACT PAPERS on this topic
- c.Kimi k1.5: Scaling Reinforcement Learning with LLMs (January 2025)
- Link: https://arxiv.org/abs/2501.12599
Contribution: Alternative approach to scaling reasoning via RL
- Complements DeepSeek-R1 with different architectural choices
- Emphasizes scaling strategies for RL training
- Addresses computational efficiency in large-scale RL
- Related: Kimi K2.5: Visual Agentic Intelligence / - [Submitted on 2 Feb 2026] / - We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to 4.5× over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
- d. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
- Referenced in: https://github.com/zjunlp/LLMAgentPapers
- Taxonomy of agentic RL approaches
- Training methods: GRPO, PPO variations, RLVR
- Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
- Challenges: Reward hacking, sample efficiency, exploration-exploitation
- Applications: Reasoning, planning, multi-agent coordination
- Key Papers Covered:
- GRPO (Group Relative Policy Optimization)
- History Resampling Policy Optimization (SRPO)
- PVPO (Pre-Estimated Value-Based Policy Optimization)
