Agent - Planning / Test test scaling
- Notes: Agents planning
In this session, our readings cover:
Required Readings: PLANNING & ORCHESTRATION
Core Component: Agent Planning Module - Goal Decomposition and Strategy Formation
How agents break down complex tasks, form plans, and orchestrate multi-step workflows, leveraging world models when available. Key Concepts: Task decomposition, planning algorithms (with/without world models), agent workflows, domain-specific planning strategies, plan-then-act vs. continuous replanning
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Agent - Planning / World Model | W10.1-Team 3-Planning | 25course |
| Test time scaling | Week14.1-T5-Test-Time-Scaling | 25course |
| Platform - Prompting Engineering Tools / Compression | W5.1.Team5-Prompt | 25course |
| Prompt Engineering | W11-team-2-prompt-engineering-2 | 24course |
| LLM Alignment - PPO | W11.2-team6-PPO | 25course |
| LLM Post-training | W14.3.DPO | 25course |
| Scaling Law and Efficiency | W11-ScalinglawEfficientLLM | 24course |
| LLM Fine Tuning | W14-LLM-FineTuning | 24course |
2025 HIGH-IMPACT PAPERS on this topic
- a. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
- Referenced in: https://github.com/zjunlp/LLMAgentPapers
- Taxonomy of agentic RL approaches
- Training methods: GRPO, PPO variations, RLVR
- Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
- Challenges: Reward hacking, sample efficiency, exploration-exploitation
- Applications: Reasoning, planning, multi-agent coordination
- Key Papers Covered:
- GRPO (Group Relative Policy Optimization)
- History Resampling Policy Optimization (SRPO)
- PVPO (Pre-Estimated Value-Based Policy Optimization)
- a. EnCompass: Separating Search from Agent Workflows (December 2025)
- arXiv: https://arxiv.org/abs/2512.03571
- Press: https://techxplore.com/news/2025-12-ai-agents-results-large-language.html Key Innovation: Separates search strategy from workflow code
- Performance: 15-40% accuracy boost on code repository translation
- Search strategies: Backtracking, parallel exploration, beam search (best: two-level beam search)
Use Cases: Code translation, digital grid transformation rules
- b. Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling (December 2025)
- Link: https://arxiv.org/abs/2512.14474
Two-Phase Paradigm:
- Modeling Phase: LLM constructs explicit model (entities, state variables, actions, constraints)
- Solution Phase: Generate plan based on explicit model
- Reduces constraint violations across medical scheduling, route planning, resource allocation, logic puzzles
- Outperforms Chain-of-Thought and ReAct
- Critical finding: Many planning failures stem from representational deficiencies, not reasoning limitations
Domains Tested: Medical scheduling, route planning, resource allocation, logic puzzles, procedural synthesis
More Readings:
Agent Planning with World Knowledge Model
- Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
- [Submitted on 23 May 2024 (v1), last revised 3 Jan 2026 (this version, v4)]
- NeurIPS 2024
- Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ``real’’ physical world. Imitating humans’ mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent’s understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. The code is available at this https URL. Comments: NeurIPS 2024
