Test time scaling
- SlideDeck: 2026-SP-W8.1-Test-Time-Scaling.pdf
- Version: current
- Notes: Agents planning
In this session, our readings cover:
Required Readings: TTS &
Core Component: Agent Planning Module - Goal Decomposition and Strategy Formation
How agents break down complex tasks, form plans, and orchestrate multi-step workflows, leveraging world models when available. Key Concepts: Task decomposition, planning algorithms (with/without world models), agent workflows, domain-specific planning strategies, plan-then-act vs. continuous replanning
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Agent - Planning / World Model | W10.1-Team 3-Planning | 25course |
| Test time scaling | Week14.1-T5-Test-Time-Scaling | 25course |
| Platform - Prompting Engineering Tools / Compression | W5.1.Team5-Prompt | 25course |
| Prompt Engineering | W11-team-2-prompt-engineering-2 | 24course |
| LLM Alignment - PPO | W11.2-team6-PPO | 25course |
| LLM Post-training | W14.3.DPO | 25course |
| Scaling Law and Efficiency | W11-ScalinglawEfficientLLM | 24course |
| LLM Fine Tuning | W14-LLM-FineTuning | 24course |
2025 HIGH-IMPACT PAPERS on this topic
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, Chen Ma
- As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing’’ has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions. Our repository is available on this https URL
a. EnCompass: Separating Search from Agent Workflows (December 2025)
- arXiv: https://arxiv.org/abs/2512.03571
- Press: https://techxplore.com/news/2025-12-ai-agents-results-large-language.html Key Innovation: Separates search strategy from workflow code
- Performance: 15-40% accuracy boost on code repository translation
- Search strategies: Backtracking, parallel exploration, beam search (best: two-level beam search)
Use Cases: Code translation, digital grid transformation rules
b. Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling (December 2025)
- Link: https://arxiv.org/abs/2512.14474
Two-Phase Paradigm:
- Modeling Phase: LLM constructs explicit model (entities, state variables, actions, constraints)
- Solution Phase: Generate plan based on explicit model
- Reduces constraint violations across medical scheduling, route planning, resource allocation, logic puzzles
- Outperforms Chain-of-Thought and ReAct
- Critical finding: Many planning failures stem from representational deficiencies, not reasoning limitations
- Domains Tested: Medical scheduling, route planning, resource allocation, logic puzzles, procedural synthesis
