Tongzhou Mu 🤖🦾🦿@tongzhou_mu
Everyone is talking about "World Models" for robotics, following the buzz from GTC 2026.
But the research landscape is shifting so fast it’s difficult to keep up.
In my view, here are the two dominant paradigms currently grounding the video world models in robot control.
---
Paradigm 1: Use the Video Model as a Simulator
The first major approach is using video world models to simulate reality. In this framework, the model predicts "what happens next" in either pixel space or latent space, conditioned on text prompts or robot actions. Much like traditional analytical simulators (e.g., IsaacSim, MuJoCo, ManiSkill), these learned simulators are used for data synthesis, planning, and evaluation.
1.1 Synthesizing Data for Policy Training
A representative work is DreamGen [1]. Given an initial frame and a language instruction, a fine-tuned video model synthesizes clips of a robot completing a task. An inverse dynamics model then labels these videos with actions to train a separate robot policy. GR00T N1 [2] uses a similar strategy. Alternatively, models can act as interactive simulators where agents (like UniSim [4]) or humans (like Interactive World Simulator [3]) generate data through interaction.
Key Advantages: Thousands of hours of "synthetic experience" at a lower cost and the ability to safely simulate rare, dangerous edge cases.
1.2 Inference-Time Planning
Instead of following a fixed path, robots can use video models to "imagine" multiple future outcomes. In V-JEPA 2 [5], an action-conditioned video model evaluates different action sequences to find the best next step. This "imagination-based planning" is also a core theme in CLASP [6], SWIM [7], VLP [8], GPC [9], DreamDojo [10], and Cosmos Policy [11]. The challenge remains fitting this heavy computation into real-time control budgets.
1.3 Policy Evaluation
Video models allow us to test policies before they ever touch physical hardware. Veo Robotics [12] demonstrates that these models can accurately predict relative performance and perform "red teaming" to expose safety violations. This approach is also seen in IRASim [13], 1XWM [14], Ctrl-World [15], and others.
Summary of Paradigm 1: While powerful, there is no "free lunch." These methods depend on prediction accuracy. Our physical world is complex, and teaching video models to handle every edge case without hallucinating physics remains a significant challenge.
---
Paradigm 2: Use the Video Model as a Policy
The second, more integrated paradigm is using the generative video model as the policy (decision-maker) itself. Because the native outputs are videos rather than robot actions, several methods have been developed to obtain control signals.
2.1 Generating Video and Action Jointly
A straightforward idea is to add an action decoder to the video model backbone and run video and action denoising jointly during inference. Representative works include DreamZero [16], Cosmos Policy [11], Motus [17], PAD [18], GR-1 [19], and GR-2 [20] (note that the GR series are not diffusion models). This method leverages the rich spatiotemporal priors of pre-trained models with minimal architecture changes.
2.2 Extracting Visual Representations for Action Generation
Rather than full generation, many methods use video models to extract deep visual representations to guide action generation. Example works include VPDD [21], VPP [22], UVA [23], UWM [24], Video Policy [25], and DiT4DiT [26]. A major advantage here is that you don’t necessarily need to run multiple denoising steps on giant models, making real-time control easier, though it remains unclear if the full potential of the video models is being utilized.
2.3 Open-loop Video Generation + Video-to-Action Translation
A rising trend involves generating a "desired future" video and using a separate inverse dynamics model to translate that video into actions. UniPi [27] pioneered this, followed by This&That [28], TesserAct [29], and 1XWM Self-Learning [30]. Some methods generate videos of humans completing tasks (Dreamitate [31], Gen2Act [32], LVP [33]) and translate those to robot actions. This approach allows video models to do exactly what they were trained for: video generation.
2.4 Closed-loop Video Generation + Video-to-Action Translation
Open-loop generation often leads to hallucinations: the model might "see" the robot picking up an apple that isn't actually there. Closed-loop generation avoids this by constantly conditioning on the latest real-world observations, replacing generated frames with real ones in the next call. Recently, mimic-video [34] and LingBot-VA [35] reached real-time speeds using KV caching and partial denoising. Most notably, the DVA [36] model released this month manages real-time generation with full video denoising, which means denoising pure noise all the way to clean video for every step. This approach seems really promising to me, because it reduces robot control into a problem of real-time video generation, which can directly benefit from large-scale video pre-training.
---
To me, the key takeaway from this evolution is how we have begun bridging the gap between the digital and physical worlds. Instead of trying to manually program every physical law, we are leveraging the implicit physics embedded in billions of web videos.
Whether we use these models as simulators or as direct policies, the objective is the same: providing robots with a “physical common sense.” By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.
[References in the comment]