Ken Macken retweetledi

New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM).
Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era.
Discoveries:
- Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions.
- X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view.
DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability.
Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones.
Today, we are proving it works. And 2026 has just begun.
Paper: World Action Models are Zero-Shot Policies.
Read it now: (thread)
English





















