Yunhao (Andy) Ge

141 posts

Yunhao (Andy) Ge

@GeYunhao

Research Scientist @NVIDIA GEAR Lab | CS PhD @USC, Ex Visiting PhD @Stanford, Amazon ML Fellow @Amzaon, intern @Google, @Microsoft | VLA, World Foundation Model

Santa Clara Katılım Ocak 2021

208 Takip Edilen500 Takipçiler

Sabitlenmiş Tweet

Yunhao (Andy) Ge@GeYunhao·4 Şub

Words in. Worlds imagined. Actions out. 🤖🌎 DreamZero lets robots dream in pixels and act—via joint video + action prediction. 🔥2× better generalization than VLAs ⚡14B @ 7 Hz 🤝Cross-embodiment transfer (w/ 10–20 min video) 🦾New robot, 30 min play, zero-shot skills intact

Joel Jang@jang_yoel

Introducing DreamZero 🤖🌎 from @nvidia > A 14B “World Action Model” that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots > The key? Jointly predicting video & actions in the same diffusion forward pass Project Page: dreamzero0.github.io 🧵 (1/10)

English

613

Yunhao (Andy) Ge retweetledi

Danfei Xu@danfei_xu·23 Mar

Introducing EgoVerse: an ecosystem for robot learning from egocentric human data. Built and tested by 4 research labs + 3 industry partners, EgoVerse enables both science and scaling 1300+ hrs, 240 scenes, 2000+ tasks, and growing Dataset design, findings, and ecosystem 🧵

English

159

812

221.3K

Yunhao (Andy) Ge@GeYunhao·24 Mar

Honored to contribute to DreamZero & Cosmos Policy 🤖 Seeing them featured in Jensen’s keynote made it all feel real — super proud of the team. GR00T N2, let’s GOOOO! 🚀

English

4.5K

Yunhao (Andy) Ge retweetledi

Hong-Xing (Koven) Yu@Koven_Yu·6 Mar

🤩Video world models are cool, but it is cooler if they can simulate any 3D physical actions in real time! Introducing RealWonder⚡️: Now you can simulate 3D physical action (robot actions, 3D forces, force fields, etc.) consequences from a single image in real time! 🧵1/6

English

274

27.9K

Yunhao (Andy) Ge retweetledi

Yunzhu Li@YunzhuLiYZ·5 Mar

For a long time, I was skeptical about action-conditioned video prediction for robotics. Many models look impressive, but once you ask them to handle long-horizon manipulation with real physical interaction, things quickly fall apart (e.g., Genie is amazing but mostly focused on navigation). This project changed my mind. I'm beyond excited to share Interactive World Simulator, a project we have been working on for the past ~1.5 years 🤖 One of the first world models that produces convincing results for long-horizon robotic manipulation involving complex physical interactions, across a diverse range of objects (rigid objects, deformables, ropes, object piles). It directly unlocks scalable data generation for robotic policy training and policy evaluation. Try it yourself (no installation needed): yixuanwang.me/interactive_wo… Play directly with the simulator in your browser. Key Takeaways: 1️⃣ 15 Hz long-horizon action-conditioned video prediction for 10+ minutes on a single RTX 4090 GPU 2️⃣ Visual and dynamic fidelity: people often ask how much sim data equals one real data point. In our experiments, it turns out to be close to one-to-one using the Interactive World Simulator 3️⃣ Stress testing matters: we emphasize interactive stress testing to understand robustness and stability and to build trust in the simulator 4️⃣ The model is trained with only ~6 hours of real-world random interaction data on a single GPU. Imagine what happens if we scale this 1000× or even 1M× Huge credit to @YXWangBot, who led this effort with countless hours of work on data collection, training recipes, and system design. I'm incredibly proud of the work he did here! Enjoy the demos and videos. We also fully open-sourced the codebase for anyone interested in applying this to their own tasks. #Robotics #RobotLearning #WorldModels #EmbodiedAI

Yixuan Wang@YXWangBot

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at #interactive-demo" target="_blank" rel="nofollow noopener">yixuanwang.me/interactive_wo…. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

English

368

74.4K

Yunhao (Andy) Ge retweetledi

Joel Jang@jang_yoel·28 Şub

𝐃𝐫𝐞𝐚𝐦𝐙𝐞𝐫𝐨 𝐢𝐬 #𝟏 𝐨𝐧 𝐛𝐨𝐭𝐡 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 𝐚𝐧𝐝 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 🏆 𝗪𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗶𝘀 𝗻𝗼𝘁𝗮𝗯𝗹𝗲: DreamZero-DROID is trained 𝑓𝑟𝑜𝑚 𝑠𝑐𝑟𝑎𝑡𝑐ℎ using only the DROID dataset. No pretraining on large-scale robot data, unlike competing VLAs. This demonstrates the strength of video-model backbones for generalist robot policies (VAMs/WAMs). More broadly, training 𝑜𝑛𝑙𝑦 on real data and evaluating on (1) transparent, distributed benchmarks like 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 or (2) scalable sim-benchmarks like 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 is an exciting step toward fairer and more reproducible evaluation of generalist policies, one that the community can hillclimb together to measure progress. Special thanks to the Ai2 MolmoSpaces team (@notmahi @omarrayyann @YejinKim4 Max Argus) and the RoboArena team (@pranav_atreya) for helping with the set-up and getting these evaluations! Special shout out to @youliangtan @NadunRanawakaA @chuning_zhu, who led these efforts from the GEAR side :) + We also release our DreamZero-AgiBot checkpoint & post-training code to enable very efficient few-shot adaptation. Post-train on just ~30 minutes of play data for your specific robot, and see the robot do basic language following and pick-and-place 🤗(See YAM experiments in our paper for more detail). ++ We also provide the entire codebase & preprocessed dataset to replicate the DreamZero-DROID checkpoint. 🌐 dreamzero0.github.io 💻 github.com/dreamzero0/dre… RoboArena: robo-arena.github.io/leaderboard MolmoSpaces: molmospaces.allen.ai/leaderboard

English

184

39K

Yunhao (Andy) Ge@GeYunhao·25 Şub

Human video is the most scalable source of physical intelligence. EgoScale answers a fundamental question: human video exhibits a clear scaling law for dexterous manipulation.

Jim Fan@DrJimFan

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

English

277

Yunhao (Andy) Ge retweetledi

Ruijie Zheng@ruijie_zheng12·25 Şub

Proud to introduce EgoScale: We pretrained a GR00T VLA model on 20K+ hours of egocentric human video and discovered that robot dexterity can be scaled, not with more robots, but with more human data. A thread on 🧵what we learned. 👇

English

331

93.1K

Yunhao (Andy) Ge retweetledi

Zhang Heqing@zhang_heqing·21 Şub

Who needs a roommate when you have a robot that makes the bed? Neat, quick, and totally in charge—this is the future we ordered! #Robot #BedMaking #FunnyRobot

English

243

488

2.4K

809.3K

Yunhao (Andy) Ge retweetledi

Xingjian Bai@SimulatedAnneal·20 Şub

Do causal video diffusers really need dense causal attention at every layer, every denoising step? We looked inside and found: no. Causality is separable from denoising. Here are two surprising observations that hold across architectures, training objectives, and scales.

English

331

67.2K

Yunhao (Andy) Ge retweetledi

Seonghyeon Ye@SeonghyeonYe·19 Şub

VLAs (from VLMs) ❌ => WAMs (from Video Models) ✅ Why WAMs? 1️⃣ World Physics: VLMs know the internet, but Video Models implicitly model the physical laws essential for manipulation. 2️⃣ The "GPT Direction": VLAs are like BERT (rely heavily on task-specific post-training). WAMs are like GPT (pre-train & prompt), unlocking incredible zero-shot transfer! What I want to see in 2026: 📈 Scaling Laws: We will see much clearer scaling laws for robotics compared to VLAs. 🤝 Human-to-Robot Transfer: Unlocking massive transfer capabilities using video as a shared representation space. 🤖 Zero-Shot Mastery: Moving from short-horizon tasks to long-horizon, dexterous manipulation without task-specific demonstrations. We recently open-sourced the checkpoints, training and inference code. Dive into the research! 👇 📄 Paper: arxiv.org/abs/2602.15922 💻 Code: github.com/dreamzero0/dre… 🤗 HF: huggingface.co/GEAR-Dreams/Dr…

English

518

74.4K

Yunhao (Andy) Ge retweetledi

Shenyuan Gao@ShenyuanGao·20 Şub

🤖 How can we enable zero-shot generalization to unseen scenarios for robot world models? Thrilled to share DreamDojo 🌎 — an interactive robot world model pretrained on 44K hours of human egocentric videos, the largest and most diverse dataset to date for robot world model learning. Our model not only excels in generalization, but also supports real-time interaction at 10 FPS after distillation. It enables several important applications, including live teleoperation, policy evaluation, and model-based planning at test time. 🔗 Project: dreamdojo-world.github.io 📰 Paper: arxiv.org/abs/2602.06949 🤗 Code & models & datasets: github.com/NVIDIA/DreamDo… #WorldModels #Robotics #EmbodiedAI #RL #AI #NVIDIA Sharing more details in the thread 🧵

English

137

57.5K

Yunhao (Andy) Ge retweetledi

Jim Fan@DrJimFan·20 Şub

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:

English

177

1.2K

203.4K

Yunhao (Andy) Ge retweetledi

Unitree@UnitreeRobotics·16 Şub

Unitree Spring Festival Gala Robots —a Full Release of Additional Details 🥳 Dozens of G1 robots achieved the world’s first fully autonomous humanoid robot cluster Kung Fu performance (with quick movement), pushing motion limits and setting multiple world firsts! H2 made striking appearances at both the Beijing main venue and the Yiwu sub-venue, clad in the Monkey King’s heavy armor and riding a “somersault cloud” played by B2W quadruped robot dogs, delivering New Year blessings from the clouds.

English

1.1K

4.4K

27.1K

27.9M

Yunhao (Andy) Ge retweetledi

Yuke Zhu@yukez·4 Şub

📢 New paper from GEAR team @NVIDIARobotics We released DreamZero, a World Action Model that turns video world models into zero-shot robot policies. Built on a pretrained video diffusion backbone, it jointly predicts future video frames and actions. 🌐 dreamzero0.github.io

Joel Jang@jang_yoel

English

102

8.2K

Yunhao (Andy) Ge retweetledi

youliang@youliangtan·4 Şub

Task generalization remains one of the hardest problems in robot learning. In Dream0, we show video generation priors can transfer surprisingly well to robotics. dreamzero0.github.io ☁️ We ranked among the top on RoboArena (DROID) even without robot data pretraining! (wip) 🦓

Seonghyeon Ye@SeonghyeonYe

We just gave robots "imagination," and the results are wild. 🤯 This robot wasn't trained to untie shoes or shake hands. It's never seen these tasks before. It simply "dreams" the future outcome, then acts to make it real. 🧵👇

English

1.6K

Yunhao (Andy) Ge retweetledi

Jim Fan@DrJimFan·4 Şub

New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM). Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era. Discoveries: - Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions. - X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view. DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability. Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones. Today, we are proving it works. And 2026 has just begun. Paper: World Action Models are Zero-Shot Policies. Read it now: (thread)

English

113

602

58.5K

Yunhao (Andy) Ge retweetledi

Seonghyeon Ye@SeonghyeonYe·4 Şub

English

15.9K

Yunhao (Andy) Ge retweetledi

Joel Jang@jang_yoel·4 Şub

English

260

58.9K

Yunhao (Andy) Ge retweetledi

Moo Jin Kim@moo_jin_kim·24 Oca

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

English

111

868

146.5K

Keşfet

@YXWangBot @notmahi @omarrayyann @YejinKim4 @pranav_atreya @youliangtan @NadunRanawakaA @chuning_zhu