Eric Chan (@ericryanchan) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Eric Chan@ericryanchan·10 Mar

Today, we announce our team’s progress in pursuing a different type of foundation model for robotics: the Direct Video Action Model (DVA), which does our best to take robotics and turn it into a generative modeling problem we can scale. Technical blog: rhoda.ai/research/direc…

English

12

26

198

20.1K

Eric Chan retweetledi

Qianqian Wang@QianqianWang5·9 Nis

The shell game is a fun challenge that cannot be solved by looking at a single frame. The model has to track every move, from the moment the object is hidden. Excited to share this!

Rhoda AI@RhodaAI

Here’s something we’ve never seen done before. Real-world tasks are long and ambiguous. Solving them requires visual memory and state tracking. Most robot policies only see the last few frames. Ours doesn't. We put our DVA, FutureVision, to the perfect testbed: the shell game 🐚. The DVA nails it.

English

2

7

70

11K

Eric Chan retweetledi

Rhoda AI@RhodaAI·9 Nis

Here’s something we’ve never seen done before. Real-world tasks are long and ambiguous. Solving them requires visual memory and state tracking. Most robot policies only see the last few frames. Ours doesn't. We put our DVA, FutureVision, to the perfect testbed: the shell game 🐚. The DVA nails it.

English

8

38

233

83.6K

Eric Chan@ericryanchan·28 Mar

@isskoro @rhodaai Glad to have you in the Bay Area and super excited to work together again :)

English

0

1

145

Ivan Skorokhodov@isskoro·28 Mar

After almost 3 years at Snap, several dozens of research and engineering projects, countless wild moments lived through together with the incredible team, I've decided to start a new adventure in robotics and join @rhodaai to bring the technological singularity a little closer

English

9

1

68

3.5K

Rhoda AI@RhodaAI·25 Mar

1/ We are speed running industrial robotics. It took us just 19 days from the first day of data collection to filming a 2.5-hour continuous run of our model autonomously breaking down industrial containers — zero human intervention. The data efficiency of our DVA model is fundamentally changing how fast we bring robots out of the lab and into the factory. Autonomous operation with 3 hours of data collection at a customer factory.

English

11

37

167

24.7K

Eric Chan@ericryanchan·25 Mar

This is exciting, because that 19 days takes into account identifying how to do the task, teleoperation practice, and many different changes to the task (adding task segments, e.g. trash picking, pushing and pulling of the boxes, randomized starts) and gradually changing the environment by adding ball tables and roller tables as they arrived. It was exciting to see that the very first models trained, with less than an afternoon of data collection, were already performing the task pretty reasonably!

English

0

5

273

Eric Chan retweetledi

Tongzhou Mu 🤖🦾🦿@tongzhou_mu·20 Mar

Everyone is talking about "World Models" for robotics, following the buzz from GTC 2026. But the research landscape is shifting so fast it’s difficult to keep up. In my view, here are the two dominant paradigms currently grounding the video world models in robot control. --- Paradigm 1: Use the Video Model as a Simulator The first major approach is using video world models to simulate reality. In this framework, the model predicts "what happens next" in either pixel space or latent space, conditioned on text prompts or robot actions. Much like traditional analytical simulators (e.g., IsaacSim, MuJoCo, ManiSkill), these learned simulators are used for data synthesis, planning, and evaluation. 1.1 Synthesizing Data for Policy Training A representative work is DreamGen [1]. Given an initial frame and a language instruction, a fine-tuned video model synthesizes clips of a robot completing a task. An inverse dynamics model then labels these videos with actions to train a separate robot policy. GR00T N1 [2] uses a similar strategy. Alternatively, models can act as interactive simulators where agents (like UniSim [4]) or humans (like Interactive World Simulator [3]) generate data through interaction. Key Advantages: Thousands of hours of "synthetic experience" at a lower cost and the ability to safely simulate rare, dangerous edge cases. 1.2 Inference-Time Planning Instead of following a fixed path, robots can use video models to "imagine" multiple future outcomes. In V-JEPA 2 [5], an action-conditioned video model evaluates different action sequences to find the best next step. This "imagination-based planning" is also a core theme in CLASP [6], SWIM [7], VLP [8], GPC [9], DreamDojo [10], and Cosmos Policy [11]. The challenge remains fitting this heavy computation into real-time control budgets. 1.3 Policy Evaluation Video models allow us to test policies before they ever touch physical hardware. Veo Robotics [12] demonstrates that these models can accurately predict relative performance and perform "red teaming" to expose safety violations. This approach is also seen in IRASim [13], 1XWM [14], Ctrl-World [15], and others. Summary of Paradigm 1: While powerful, there is no "free lunch." These methods depend on prediction accuracy. Our physical world is complex, and teaching video models to handle every edge case without hallucinating physics remains a significant challenge. --- Paradigm 2: Use the Video Model as a Policy The second, more integrated paradigm is using the generative video model as the policy (decision-maker) itself. Because the native outputs are videos rather than robot actions, several methods have been developed to obtain control signals. 2.1 Generating Video and Action Jointly A straightforward idea is to add an action decoder to the video model backbone and run video and action denoising jointly during inference. Representative works include DreamZero [16], Cosmos Policy [11], Motus [17], PAD [18], GR-1 [19], and GR-2 [20] (note that the GR series are not diffusion models). This method leverages the rich spatiotemporal priors of pre-trained models with minimal architecture changes. 2.2 Extracting Visual Representations for Action Generation Rather than full generation, many methods use video models to extract deep visual representations to guide action generation. Example works include VPDD [21], VPP [22], UVA [23], UWM [24], Video Policy [25], and DiT4DiT [26]. A major advantage here is that you don’t necessarily need to run multiple denoising steps on giant models, making real-time control easier, though it remains unclear if the full potential of the video models is being utilized. 2.3 Open-loop Video Generation + Video-to-Action Translation A rising trend involves generating a "desired future" video and using a separate inverse dynamics model to translate that video into actions. UniPi [27] pioneered this, followed by This&That [28], TesserAct [29], and 1XWM Self-Learning [30]. Some methods generate videos of humans completing tasks (Dreamitate [31], Gen2Act [32], LVP [33]) and translate those to robot actions. This approach allows video models to do exactly what they were trained for: video generation. 2.4 Closed-loop Video Generation + Video-to-Action Translation Open-loop generation often leads to hallucinations: the model might "see" the robot picking up an apple that isn't actually there. Closed-loop generation avoids this by constantly conditioning on the latest real-world observations, replacing generated frames with real ones in the next call. Recently, mimic-video [34] and LingBot-VA [35] reached real-time speeds using KV caching and partial denoising. Most notably, the DVA [36] model released this month manages real-time generation with full video denoising, which means denoising pure noise all the way to clean video for every step. This approach seems really promising to me, because it reduces robot control into a problem of real-time video generation, which can directly benefit from large-scale video pre-training. --- To me, the key takeaway from this evolution is how we have begun bridging the gap between the digital and physical worlds. Instead of trying to manually program every physical law, we are leveraging the implicit physics embedded in billions of web videos. Whether we use these models as simulators or as direct policies, the objective is the same: providing robots with a “physical common sense.” By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence. [References in the comment]

English

9

82

571

37.3K

Rhoda AI@RhodaAI·18 Mar

Most robot demos are “golden runs”: a perfect take selected from many attempts. But real-world deployment is about Continuous Operation. Watch our DVA model tackle a real-world decanting task for 1.5 hours straight: Uncut, Zero human intervention. 🧵👇

English

4

10

45

4K

Eric Chan@ericryanchan·19 Mar

@rhodaai The most challenging part of a real-world task is handling all of the edge cases. A powerful base model is needed to achieve high robustness without requiring a lot of robot data.

English

0

4

91

Eric Chan retweetledi

Yilun Du@du_yilun·11 Mar

Robot video foundation models can build very powerful robot manipulation policies! These policies enable complex, dexterous manipulation, solve tasks that require long-term visual memory, and do in-context demonstration learning!

Rhoda AI@RhodaAI

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English

0

3

24

2.5K

Eric Chan retweetledi

Stephen James@stepjamUK·11 Mar

Excited to see @rhodaai come out of stealth! As their advisor, I've had a front-row seat of their work on Direct Video-Action Models which reformulates robot control as video generation. The data efficiency here is super promising. Complex industrial tasks learned from just ~10 hours of robot data. Big things ahead!

Rhoda AI@RhodaAI

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English

2

13

1.4K

Eric Chan@ericryanchan·11 Mar

Vincent has been an inspiration for me since I started in AI — it's not an exaggeration that I wouldn't have done research at all if it were not for him. Thank you for the kind words!

Vincent Sitzmann@vincesitzmann

These are very impressive results! The Rhoda team has decisively gotten "video models for robotics" to work. They train a generalist real-time, causal video model that they then quickly fine-tune using task-specific data to generate video plans (1/n)

English

0

11

3.1K

Eric Chan@ericryanchan·10 Mar

@QianqianWang5 is a brilliant researcher and we're very lucky to have her on the team! I'm especially excited about her explorations on handling long context, since it's so important for pushing generalization and task complexity

Qianqian Wang@QianqianWang5

Very excited to share our exploration of a new robotics foundation model at Rhoda AI. We train a causal video model from scratch, unlocking new capabilities for robust, long-horizon closed-loop robot control. Learn more: rhoda.ai/research/direc…

English

0

21

651

Eric Chan@ericryanchan·10 Mar

@vincesitzmann Fantastic work, @Weiyu_Liu_!

English

1

0

1

79

Vincent Sitzmann@vincesitzmann·10 Mar

I personally was most impressed by the in-context learning demonstrations - it's really impressive to see robots that can learn to act from a human demonstration! x.com/rhoda_ai_/stat…

Rhoda AI@RhodaAI

Because we support long-context visual memory, our robots can learn on the fly. Show the robot a single human demonstration, and it understands both the intent and the motion. It can even extrapolate to novel objects and environments it's never seen before. 🧺✍️

English

2

1

10

1.6K

Vincent Sitzmann@vincesitzmann·10 Mar

These are very impressive results! The Rhoda team has decisively gotten "video models for robotics" to work. They train a generalist real-time, causal video model that they then quickly fine-tune using task-specific data to generate video plans (1/n)

Rhoda AI@RhodaAI

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English

1

2

40

7.5K

Eric Chan@ericryanchan·10 Mar

But the long context also gives a very natural way of doing one-shot learning: we can simply shove the example demonstration into the context window. This may eventually let us do real tasks without any robot data at all! x.com/rhoda_ai_/stat…

Rhoda AI@RhodaAI

Because we support long-context visual memory, our robots can learn on the fly. Show the robot a single human demonstration, and it understands both the intent and the motion. It can even extrapolate to novel objects and environments it's never seen before. 🧺✍️

English

0

1

6

1.1K

Eric Chan@ericryanchan·10 Mar

Another key advantage is our models gain the ability to handle long context almost for free by training on lots of long videos. For robotics, this is important for handling long-context tasks x.com/rhoda_ai_/stat…

Rhoda AI@RhodaAI

Most robots have "amnesia": they only see a few frames at a time. 🧠 In contrast, our model natively supports hundreds of frames of visual context, enabling it to: → Keep track of the world state → Handle complex, multi-step tasks end-to-end

English

1

6

1.6K

Eric Chan@ericryanchan·10 Mar

Today, we announce our team’s progress in pursuing a different type of foundation model for robotics: the Direct Video Action Model (DVA), which does our best to take robotics and turn it into a generative modeling problem we can scale. Technical blog: rhoda.ai/research/direc…

English

12

26

198

20.1K

Eric Chan@ericryanchan·10 Mar

@tianyuanzhang99 Thank you, Tianyuan! Yes, we see the same benefits—it’s absolutely more scalable because there is so much more data, and in addition, the ability to simulate next states opens up a world of possibilities for planning, eval, and inference time scaling!

English

0

1

134

Tianyuan Zhang@tianyuanzhang99·10 Mar

Congrats! Excited to see video generated actions being deployed on real world. Video model learns both world simulating and action planning. more importantly, it's not data bounded yet, more compute bounded.

Eric Chan@ericryanchan

@startupjag @rhodaai Incredibly excited to introduce a new type of foundation model for robotics. At its core, robotics is a data problem, but that doesn't mean collecting data directly is the only solution.

English

1

0

21

6K

Eric Chan@ericryanchan·10 Mar

Thrilled to announce what we’ve been working on for the last 17 months, at the intersection of real-time video generation and robotics! We’ve published a technical blog that showcases some of the things we’ve learned along the way.

Rhoda AI@RhodaAI

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English

1

14

1.9K

Eric Chan retweetledi

Vinod Khosla@vkhosla·10 Mar

The bar for robotics isn’t lab demos — it’s autonomous operation in real production environments. What impressed me about @rhodaai was seeing that level of performance with remarkably little robot training data. Pretraining on internet-scale video to build a strong physical prior may seem unconventional today, but approaches like this are what will ultimately unlock general-purpose robotics.

Jagdeep Singh@startupjag

After operating in stealth for the last 18 months @rhodaai , we’re excited today to finally show the world what we’ve been working on. We believe we’re on a path to physical AGI with the launch of our brand new foundation model, the Direct Video Action (DVA) model.

English

23

38

292

68.8K

Eric Chan

Keşfet