Agent - latent space model
- SlideDeck: 2026-SP-S9.2-latent_space.pdf
- Version: current
- Notes: Understanding environments for Agents
In this session, our readings cover:
Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING
Core Component: Internal Representations - How Agents Model Their Environment
World models enable agents to build internal representations of their environment, predict outcomes, and simulate consequences before taking action. This bridges perception and planning.
Key Concepts: Environment modeling, state representation, predictive models, simulation-based planning, model-based reasoning
World Model Role in Agent Architecture:
- Input: Receives data from Perception (Phase 3) and Memory (Phase 4)
- Function: Builds internal representation of environment dynamics and causal relationships
- Output: Informs Planning (Phase 7) by enabling agents to predict action consequences
- Use Cases: Robotics, game playing, strategic decision-making, healthcare interventions
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Agent - Planning / World Model | W10.1-Team 3-Planning | 25course |
2025 HIGH-IMPACT PAPERS on this topic
Agent Planning with World Knowledge Model
- Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
- [Submitted on 23 May 2024 (v1), last revised 3 Jan 2026 (this version, v4)]
- NeurIPS 2024
-
Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ``real’’ physical world. Imitating humans’ mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent’s understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. The code is available at this https URL. Comments: NeurIPS 2024
- a. AgentGym-RL: Training Agents for Long-Horizon Decision Making (September 2025)
- https://github.com/WooooDyy/LLM-Agent-Paper-List
- RL version of AgentGym for learning from interactive environments
- Interactive frontend for trajectory visualization, multi-turn RL
- b. DreamerV3: Mastering Diverse Control Tasks through World Models
-
Nature (April 2025) / arXiv GitHub - A general reinforcement-learning algorithm that outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its behaviour by imagining future scenarios.
- Dreamer succeeds across domains ranging from robot locomotion and manipulation tasks over Atari games, procedurally generated ProcGen levels, and DMLab tasks to the complex and infinite world of Minecraft.
- First algorithm to collect diamonds in Minecraft from scratch without human data or curricula
- Uses Recurrent State-Space Model (RSSM) for latent imagination and planning
-
- c. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
-
arXiv GitHub Meta AI - The first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
- Post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset enables zero-shot deployment on Franka arms without collecting any data from those environments.
- V-JEPA 2-AC achieves reach = 100%, manipulation = 60–80% compared to Cosmos’s reach = 80%, manipulation = 0–20%, while being 15× faster (16 seconds/action vs 4 minutes).
- Predicts in representation space rather than pixel space—key innovation for efficient planning
-
More Readings:
. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- DeepMind Blog
- RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
- Thanks to its VLM backbone, RT-2 can plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.
- Uses PaLM-E and PaLI-X backbones; demonstrates chain-of-thought reasoning for multi-stage semantic reasoning
Video Understanding with Large Language Models: A Survey
- Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
- [Submitted on 29 Dec 2023 (v1), last revised 24 Jul 2024 (this version, v4)]
- With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at this https URL.
