Agent - Planning
- SlideDeck: 2026-SP-W8.2-AgenticPlanReason.pdf
- Version: current
- Notes: Agents planning
In this session, our readings cover:
Required Readings: PLANNING & ORCHESTRATION
a. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
- Referenced in: https://github.com/zjunlp/LLMAgentPapers
- Taxonomy of agentic RL approaches
- Training methods: GRPO, PPO variations, RLVR
- Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
- Challenges: Reward hacking, sample efficiency, exploration-exploitation
- Applications: Reasoning, planning, multi-agent coordination
- Key Papers Covered:
- GRPO (Group Relative Policy Optimization)
- History Resampling Policy Optimization (SRPO)
- PVPO (Pre-Estimated Value-Based Policy Optimization)
b. End-to-End Test-Time Training for Long Context
Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available. Comments: Code: this https URL
