Agent - World model
- Notes: Understanding enviroments for Agents
In this session, our readings cover:
Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING
Core Component: Internal Representations - How Agents Model Their Environment
World models enable agents to build internal representations of their environment, predict outcomes, and simulate consequences before taking action. This bridges perception and planning.
Key Concepts: Environment modeling, state representation, predictive models, simulation-based planning, model-based reasoning
World Model Role in Agent Architecture:
- Input: Receives data from Perception (Phase 3) and Memory (Phase 4)
- Function: Builds internal representation of environment dynamics and causal relationships
- Output: Informs Planning (Phase 7) by enabling agents to predict action consequences
- Use Cases: Robotics, game playing, strategic decision-making, healthcare interventions
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Agent - Planning / World Model | W10.1-Team 3-Planning | 25course |
2025 HIGH-IMPACT PAPERS on this topic
- b. DreamerV3: Mastering Diverse Control Tasks through World Models
-
Nature (April 2025) / arXiv GitHub - A general reinforcement-learning algorithm that outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its behaviour by imagining future scenarios.
- Dreamer succeeds across domains ranging from robot locomotion and manipulation tasks over Atari games, procedurally generated ProcGen levels, and DMLab tasks to the complex and infinite world of Minecraft.
- First algorithm to collect diamonds in Minecraft from scratch without human data or curricula
- Uses Recurrent State-Space Model (RSSM) for latent imagination and planning
-
- c. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
-
arXiv GitHub Meta AI - The first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
- Post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset enables zero-shot deployment on Franka arms without collecting any data from those environments.
- V-JEPA 2-AC achieves reach = 100%, manipulation = 60–80% compared to Cosmos’s reach = 80%, manipulation = 0–20%, while being 15× faster (16 seconds/action vs 4 minutes).
- Predicts in representation space rather than pixel space—key innovation for efficient planning
-
- c. NVIDIA Cosmos: World Foundation Model Platform for Physical AI
-
NVIDIA Cosmos Technical Report - Open world foundation models (WFMs), guardrails, and data processing libraries to accelerate the development of physical AI for autonomous vehicles (AVs), robots, and video analytics AI agents.
- WFMs are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data.
- Cosmos Reason—a new open, customizable, 7-billion-parameter reasoning VLM for physical AI and robotics—lets robots and vision AI agents reason like humans using prior knowledge, physics understanding and common sense.
- Early adopters include 1X, Agility Robotics, Figure AI, Skild AI, Boston Dynamics
-
- d. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- DeepMind Blog
- RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
- Thanks to its VLM backbone, RT-2 can plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.
- Uses PaLM-E and PaLI-X backbones; demonstrates chain-of-thought reasoning for multi-stage semantic reasoning
More Readings:
Video Understanding with Large Language Models: A Survey
- Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
- [Submitted on 29 Dec 2023 (v1), last revised 24 Jul 2024 (this version, v4)]
- With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at this https URL.
