Agent Brain - Reasoning
- Notes: world model
In this session, our readings cover:
Required Readings: REASONING & COGNITION
Core Component: Advanced Reasoning Capabilities of the Agent Brain
Exploring how agents reason through complex problems, including code generation, mathematical reasoning, and domain-specific reasoning.
Key Concepts: Chain-of-thought reasoning, code generation, mathematical reasoning, self-examination, test-time compute scaling
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Advanced LLM - Code Reasoning | W4.1-Gen AI-code | 25course |
| Advanced LLM - Math Reasoning | W4.2-LLM-Math-Reasoning | 25course |
| Inference Test Time Scaling Law | Week14.1-T5-Test-Time-Scaling | 25course |
| Self-exam LLM and Reasoning | W12-team-2-self-exam-LLM | 24course |
2025 HIGH-IMPACT PAPERS on this topic
- a. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (January 2025)
- Authors: DeepSeek-AI (198 authors)
- Venue: Nature (September 2025) + arXiv
- arXiv: https://arxiv.org/abs/2501.12948
- Nature: https://www.nature.com/articles/s41586-025-09422-z
- HuggingFace: https://huggingface.co/papers/2501.12948
- GitHub: https://github.com/deepseek-ai/DeepSeek-R1
- Pure RL approach - Shows reasoning emerges without supervised demonstrations
- Remarkable results: AIME 2024 accuracy jumped from 15.6% → 71.0% (pass@1) → 86.7% (majority voting), matching OpenAI o1
- Emergent behaviors: Self-reflection, verification, strategy adaptation, “aha moments”
- Open source: Released models from 1.5B to 671B parameters
- Industry impact: Triggered the “reasoning model” race across all major labs
- Key Innovation: Demonstrates that advanced reasoning patterns emerge naturally through GRPO (Group Relative Policy Optimization) without human-labeled trajectories. The paper shows thinking time scales with performance - agents learn to “think longer” for harder problems.
- b. Reasoning Language Models: A Blueprint (January 2025)
- https://arxiv.org/abs/2501.11223
- Reinforcement learning approaches for reasoning
- Connects to DeepSeek-R1, Kimi k1.5, and other reasoning models
- Comprehensive taxonomy of RLVR (Reinforcement Learning with Verifiable Rewards)
- Discusses emergent reasoning patterns and distillation to smaller models
- c. Kimi k1.5: Scaling Reinforcement Learning with LLMs (January 2025)
- Link: https://arxiv.org/abs/2501.12599
Contribution: Alternative approach to scaling reasoning via RL
- Complements DeepSeek-R1 with different architectural choices
- Emphasizes scaling strategies for RL training
- Addresses computational efficiency in large-scale RL
