Agent - World model

Notes: Understanding enviroments for Agents

In this session, our readings cover:

Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING

Core Component: Internal Representations - How Agents Model Their Environment

World models enable agents to build internal representations of their environment, predict outcomes, and simulate consequences before taking action. This bridges perception and planning.

Key Concepts: Environment modeling, state representation, predictive models, simulation-based planning, model-based reasoning

World Model Role in Agent Architecture:

Input: Receives data from Perception (Phase 3) and Memory (Phase 4)
Function: Builds internal representation of environment dynamics and causal relationships
Output: Informs Planning (Phase 7) by enabling agents to predict action consequences
Use Cases: Robotics, game playing, strategic decision-making, healthcare interventions

Topic	Slide Deck	Previous Semester
Agent - Planning / World Model	W10.1-Team 3-Planning	25course

2025 HIGH-IMPACT PAPERS on this topic

b. DreamerV3: Mastering Diverse Control Tasks through World Models
- Nature (April 2025) / arXiv GitHub
- A general reinforcement-learning algorithm that outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its behaviour by imagining future scenarios.
- Dreamer succeeds across domains ranging from robot locomotion and manipulation tasks over Atari games, procedurally generated ProcGen levels, and DMLab tasks to the complex and infinite world of Minecraft.
- First algorithm to collect diamonds in Minecraft from scratch without human data or curricula
- Uses Recurrent State-Space Model (RSSM) for latent imagination and planning
c. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- arXiv GitHub Meta AI
- The first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
- Post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset enables zero-shot deployment on Franka arms without collecting any data from those environments.
- V-JEPA 2-AC achieves reach = 100%, manipulation = 60–80% compared to Cosmos’s reach = 80%, manipulation = 0–20%, while being 15× faster (16 seconds/action vs 4 minutes).
- Predicts in representation space rather than pixel space—key innovation for efficient planning
c. NVIDIA Cosmos: World Foundation Model Platform for Physical AI
- NVIDIA Cosmos Technical Report
- Open world foundation models (WFMs), guardrails, and data processing libraries to accelerate the development of physical AI for autonomous vehicles (AVs), robots, and video analytics AI agents.
- WFMs are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data.
- Cosmos Reason—a new open, customizable, 7-billion-parameter reasoning VLM for physical AI and robotics—lets robots and vision AI agents reason like humans using prior knowledge, physics understanding and common sense.
- Early adopters include 1X, Agility Robotics, Figure AI, Skild AI, Boston Dynamics
d. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- DeepMind Blog
- RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
- Thanks to its VLM backbone, RT-2 can plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.
- Uses PaLM-E and PaLI-X backbones; demonstrates chain-of-thought reasoning for multi-stage semantic reasoning

2026 Spring UVA CS - GenAI-Overview

Agent - World model

Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING

2025 HIGH-IMPACT PAPERS on this topic

More Readings:

Video Understanding with Large Language Models: A Survey