Leo Dong

24 posts

Leo Dong

@leooeld

Scaling robot intelligence @rhoda_ai Prev @Stanford

Palo Alto, CA Katılım Mart 2026

34 Takip Edilen76 Takipçiler

Leo Dong@leooeld·5d

@jbhuang0604 Probably my favorite youtube channel though. I learn something new from your videos even after reading the paper. Please keep posting more!

English

292

Jia-Bin Huang@jbhuang0604·5d

Putting yourself out there means getting real-world feedback ... and it’s not always rainbows and unicorns. 🫠 The real challenge is building a stronger mindset to take all this feedback, good or bad, and keep improving!

Jia-Bin Huang@jbhuang0604

How is DeepSeek V4 so INSANELY cheap? 🤔 Compared to a GQA baseline, it's new *compressed attention* mechanism (CSA and HCA) slashes the KV cache memory cost by 98% 🤯 at a 1M-token context! Here’s how: youtu.be/q8holiIirgo

English

189

20.8K

Leo Dong@leooeld·6d

A common misconception is that video models are too slow to run as closed-loop policies. We’ve shown that not only can they be fast, but they’re fast enough to run on a single RTX 5090! It turns out that if you co-design your model architecture and inference optimizations together, you can squeeze out a lot more FLOPs than you think. I deliberately challenged myself not to use “common” tricks like quantization and distillation—IMO they create a train-test gap and shouldn't be the first order of optimization. They’re still very useful tools that I might use in the future, but we’ve shown that they aren't strictly necessary to achieve a closed-loop system. This was my first project since joining Rhoda, and it really shows what a small, cross-functional research team can accomplish. Everyone works across the full stack—people from diffusion, representation learning, robot learning, and ML systems backgrounds all learning from one another and owning the final outcome. That’s where the magic happens!

Rhoda AI@RhodaAI

Can a large foundation video model run as a real-time robot policy at the edge, on a single RTX 5090? • ✅ No quantization • ✅ No distillation • ✅ Full denoising (all the way from noise to clean video) We just proved it's possible. 👇🎬

English

4.7K

Leo Dong@leooeld·6d

@RhodaAI One more excuse to reimburse a gaming PC for work

English

737

Rhoda AI@RhodaAI·6d

English

212

32.8K

Leo Dong@leooeld·6d

@RhodaAI 💨💨💨

QME

450

Leo Dong@leooeld·23 Nis

In-context learning is such an elegant application of a generative video model as robot policy.

Rhoda AI@RhodaAI

Teaching a robot a new task typically means stopping operations, collecting teleoperated demonstrations, and retraining. That process takes hours at a minimum. We wanted to know if we could collapse it to seconds — from a single human demo, on the fly, no retraining required. Early research preview: we can.

English

156

Leo Dong@leooeld·9 Nis

@QianqianWang5 Amazing work Qianqian!

English

110

Qianqian Wang@QianqianWang5·9 Nis

The shell game is a fun challenge that cannot be solved by looking at a single frame. The model has to track every move, from the moment the object is hidden. Excited to share this!

Rhoda AI@RhodaAI

Here’s something we’ve never seen done before. Real-world tasks are long and ambiguous. Solving them requires visual memory and state tracking. Most robot policies only see the last few frames. Ours doesn't. We put our DVA, FutureVision, to the perfect testbed: the shell game 🐚. The DVA nails it.

English

10.8K

Leo Dong@leooeld·9 Nis

Natively causal video model + long KV cache = robot policy that naturally understands the physical world: state tracking, object permanence, temporal reasoning, etc.

Rhoda AI@RhodaAI

English

168

Rhoda AI@RhodaAI·25 Mar

1/ We are speed running industrial robotics. It took us just 19 days from the first day of data collection to filming a 2.5-hour continuous run of our model autonomously breaking down industrial containers — zero human intervention. The data efficiency of our DVA model is fundamentally changing how fast we bring robots out of the lab and into the factory. Autonomous operation with 3 hours of data collection at a customer factory.

English

167

24.6K

Leo Dong@leooeld·25 Mar

@rhodaai It was a great time working with @Weiyu_Liu_ on this!

English

194

Leo Dong@leooeld·22 Mar

Extremely well-written review of the current landscape

Tongzhou Mu 🤖🦾🦿@tongzhou_mu

Everyone is talking about "World Models" for robotics, following the buzz from GTC 2026. But the research landscape is shifting so fast it’s difficult to keep up. In my view, here are the two dominant paradigms currently grounding the video world models in robot control. --- Paradigm 1: Use the Video Model as a Simulator The first major approach is using video world models to simulate reality. In this framework, the model predicts "what happens next" in either pixel space or latent space, conditioned on text prompts or robot actions. Much like traditional analytical simulators (e.g., IsaacSim, MuJoCo, ManiSkill), these learned simulators are used for data synthesis, planning, and evaluation. 1.1 Synthesizing Data for Policy Training A representative work is DreamGen [1]. Given an initial frame and a language instruction, a fine-tuned video model synthesizes clips of a robot completing a task. An inverse dynamics model then labels these videos with actions to train a separate robot policy. GR00T N1 [2] uses a similar strategy. Alternatively, models can act as interactive simulators where agents (like UniSim [4]) or humans (like Interactive World Simulator [3]) generate data through interaction. Key Advantages: Thousands of hours of "synthetic experience" at a lower cost and the ability to safely simulate rare, dangerous edge cases. 1.2 Inference-Time Planning Instead of following a fixed path, robots can use video models to "imagine" multiple future outcomes. In V-JEPA 2 [5], an action-conditioned video model evaluates different action sequences to find the best next step. This "imagination-based planning" is also a core theme in CLASP [6], SWIM [7], VLP [8], GPC [9], DreamDojo [10], and Cosmos Policy [11]. The challenge remains fitting this heavy computation into real-time control budgets. 1.3 Policy Evaluation Video models allow us to test policies before they ever touch physical hardware. Veo Robotics [12] demonstrates that these models can accurately predict relative performance and perform "red teaming" to expose safety violations. This approach is also seen in IRASim [13], 1XWM [14], Ctrl-World [15], and others. Summary of Paradigm 1: While powerful, there is no "free lunch." These methods depend on prediction accuracy. Our physical world is complex, and teaching video models to handle every edge case without hallucinating physics remains a significant challenge. --- Paradigm 2: Use the Video Model as a Policy The second, more integrated paradigm is using the generative video model as the policy (decision-maker) itself. Because the native outputs are videos rather than robot actions, several methods have been developed to obtain control signals. 2.1 Generating Video and Action Jointly A straightforward idea is to add an action decoder to the video model backbone and run video and action denoising jointly during inference. Representative works include DreamZero [16], Cosmos Policy [11], Motus [17], PAD [18], GR-1 [19], and GR-2 [20] (note that the GR series are not diffusion models). This method leverages the rich spatiotemporal priors of pre-trained models with minimal architecture changes. 2.2 Extracting Visual Representations for Action Generation Rather than full generation, many methods use video models to extract deep visual representations to guide action generation. Example works include VPDD [21], VPP [22], UVA [23], UWM [24], Video Policy [25], and DiT4DiT [26]. A major advantage here is that you don’t necessarily need to run multiple denoising steps on giant models, making real-time control easier, though it remains unclear if the full potential of the video models is being utilized. 2.3 Open-loop Video Generation + Video-to-Action Translation A rising trend involves generating a "desired future" video and using a separate inverse dynamics model to translate that video into actions. UniPi [27] pioneered this, followed by This&That [28], TesserAct [29], and 1XWM Self-Learning [30]. Some methods generate videos of humans completing tasks (Dreamitate [31], Gen2Act [32], LVP [33]) and translate those to robot actions. This approach allows video models to do exactly what they were trained for: video generation. 2.4 Closed-loop Video Generation + Video-to-Action Translation Open-loop generation often leads to hallucinations: the model might "see" the robot picking up an apple that isn't actually there. Closed-loop generation avoids this by constantly conditioning on the latest real-world observations, replacing generated frames with real ones in the next call. Recently, mimic-video [34] and LingBot-VA [35] reached real-time speeds using KV caching and partial denoising. Most notably, the DVA [36] model released this month manages real-time generation with full video denoising, which means denoising pure noise all the way to clean video for every step. This approach seems really promising to me, because it reduces robot control into a problem of real-time video generation, which can directly benefit from large-scale video pre-training. --- To me, the key takeaway from this evolution is how we have begun bridging the gap between the digital and physical worlds. Instead of trying to manually program every physical law, we are leveraging the implicit physics embedded in billions of web videos. Whether we use these models as simulators or as direct policies, the objective is the same: providing robots with a “physical common sense.” By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence. [References in the comment]

English

137

Leo Dong@leooeld·11 Mar

yes.

martin_casado@martin_casado

It is a truth universally acknowledged, that a single founder in possession of a real time video model, must be in want to also do robotics.

QST

153

Leo Dong@leooeld·10 Mar

Next-frame prediction naturally enables in-context learning, just like next-token prediction. This enables the model to follow human demonstration in one-shot. @Weiyu_Liu_ awesome work!

Vincent Sitzmann@vincesitzmann

I personally was most impressed by the in-context learning demonstrations - it's really impressive to see robots that can learn to act from a human demonstration! x.com/rhoda_ai_/stat…

English

168

Leo Dong@leooeld·10 Mar

Some closing thoughts… The past 9 months have been incredibly rewarding to me after diving back into research. I started my research journey during my undergrad and master’s on 3D computer vision, spent a few years in industry on simulation and inference, and just as I was getting curious about academia again, I learned about Rhoda from my friend and old labmate @alexwbergman. People talk about end-to-end models religiously, but that only works when you have abundant end-to-end data. What’s beautiful about DVA to me is that it attacks the robot data problem in a principled way, by identifying a “min-cut” in the asymmetry of data availability. This reformulation of robot learning turns a data-poor dilemma into a data-rich quest for scaling. Let’s build together: rhoda.ai/careers 🚀 6/n

English

Leo Dong@leooeld·10 Mar

Here are my top 3 highlights from the container breakdown task: 1️⃣ The model fails to grab the trash because it is out of reach, so it decides to reposition it before attempting another grab. 2️⃣ The door won’t fall open. The model recognizes a latch probably wasn’t fully released and goes back to fix it. 3️⃣ The box has drifted too far to reach the latch. The model pulls it back into range. Learn more in our blog post: rhoda.ai/research/direc… 5/n

English

Leo Dong@leooeld·10 Mar

Why wait for millions of robot hours when you have the entire internet? So excited to share what I've been working on under the radar in the past year! Our approach turns video generation into robot intelligence. With a massively pretrained video model for predictive control, we can train an incredibly robust robot policy with just 10-20 hours of robot data. 1/n

Rhoda AI@RhodaAI

Why video? Web-scale video is the most scalable data source for the physical world. 🌍 Our DVA models leverage causal video pre-training to learn physics before they ever touch a robot. Robot control is now a real-time video prediction problem. 📽️➡️🦾

English

100

Leo Dong@leooeld·10 Mar

@rhodaai So proud of what we have accomplished. A new paradigm in scaling robot intelligence is coming to life!

English

Rhoda AI@RhodaAI·10 Mar

The gap between robotics in the lab and robotics in the real world has been one of the hardest unsolved problems in the industry. We’re excited to come out of stealth and show the research community how we’re tackling the issue. Bloomberg article in comment.

English

19.8K

Leo Dong retweetledi

Rhoda AI@RhodaAI·9 Mar

03.10.26

253

47.9K

Keşfet

@jbhuang0604 @RhodaAI @QianqianWang5 @rhodaai @Weiyu_Liu_ @alexwbergman @elonmusk @BarackObama