Julen Urain

285 posts

Julen Urain

@robotgradient

Robotics Tinkerer. RS @Amazon FAR Prev: @META (FAIR), @DFKI, @TUDarmstadt https://t.co/RQpq7Prbln X https://t.co/umZQeDjJv4

शामिल हुए Kasım 2017

1.4K फ़ॉलोइंग1.2K फ़ॉलोवर्स

Julen Urain रीट्वीट किया

Yixuan Wang@YXWangBot·5 Mar

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at #interactive-demo" target="_blank" rel="nofollow noopener">yixuanwang.me/interactive_wo…. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

English

470

107.1K

Julen Urain रीट्वीट किया

Jitendra MALIK@JitendraMalikCV·22 Şub

Pretraining with dynamics models of motor behavior (aka world models) from video will be much more central to robotics than VLMs. There are multiple choices of representations (e.g. 3D? JEPA?) but we will figure this out by and by. Exciting times!

Jim Fan@DrJimFan

- Project website: dreamdojo-world.github.io - Paper: arxiv.org/abs/2602.06949 - Code repo and model ckpts: github.com/NVIDIA/DreamDo… This is a huge team work at NVIDIA. All credits go to the wonderful teams who poured their hearts into it!

English

365

81.3K

Julen Urain रीट्वीट किया

Wenlong Huang@wenlong_huang·18 Şub

Fully agreed with the sentiment that much of computer vision research (concretely, those not for “human consumption”) should be grounded in robotics. But as a robotics researcher, I think the more nuanced question is: how can we *rethink* these intermediate representations for embodied intelligence rather than discarding them? Why? The challenge, as also pointed out in Vincent’s article, is precisely the lack of perception-action data at scale. This is why intermediate representations IMO are *preferable rather than obsolete* because they open up training from scalable data sources. This can include even the vision/language encoders people love and use in robot learning — it’s hard to imagine training low-level visual representation or high-level language understanding purely from limited robot data. The same goes for intermediate representations at the structure level — world modeling, learning from Internet videos, learning from humans, and simulation — many of which still rely on 3D representations too.

Vincent Sitzmann@vincesitzmann

In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…

English

10.7K

Julen Urain रीट्वीट किया

Jitendra MALIK@JitendraMalikCV·14 Şub

At the RI seminar at CMU yesterday, I presented a 3 level analysis of robot skills & discussed the pros and cons of teleoperation, simulation, and learning from videos, before presenting our research. Enjoy! youtube.com/watch?v=ry8iti…

YouTube

English

358

100.8K

Julen Urain@robotgradient·13 Şub

@artemZholus It reminds me to TD learning or even GAIL. I am not convinced of bootsrapping for generative models .

English

Artem Zholus@artemZholus·13 Şub

I am reading the drifting models paper and I am very excited about it! One observation: I think the drifting field is an approximation of the gradient field of linear critic from Wasserstein GANs, or at least they are closely related. what do you think?

English

381

Julen Urain@robotgradient·12 Şub

@chris_j_paxton @notmahi Mahi is the mastermind to change the robotics paradigm!

English

102

Chris Paxton@chris_j_paxton·11 Şub

One side note is that @notmahi is one of the fairly few people who's been consistently NOT training foundation models, but instead aiming to train tiny models that actually just work anywhere, and this is kind of the obvious endgame of that philosophy

C Zhang@ChongZitaZhang

okay, actually yes

English

7.5K

Julen Urain रीट्वीट किया

Hao Zhang@HaoZhang623·31 Oca

As video world models become increasingly powerful, do we still need explicit 3D? A commonly misunderstood point is this: video world models are not “just 2D.” Their ability to maintain multi-view consistency, temporal stability, and realistic interaction necessarily implies that their latent knowledge encodes 3D world structure. Without some notion of 3D, consistency itself would not be possible. The real distinction, therefore, is not whether a model has 3D but whether that 3D exists implicitly or explicitly. Implicit 3D lives inside latent spaces and network weights. It supports generation, but it is difficult to localize, edit, constrain, or reason about. It allows the world to exist, but not to be used. Explicit 3D, in contrast, exists as structure and state: it is addressable, editable, composable, and transferable. Its purpose is not better visual fidelity, but operability to allow the world to be manipulated, controlled, and executed. From this perspective, video and 3D are not competing paradigms but a layered system: 2D/video is the interface to human perception; 3D is the interface to the physical world. They can reinforce each other, but neither forms a closed loop on its own. In practice, data not model architecture sets the upper bound of world models. Explicit 3D may not be the final user-facing representation, but it is likely the most effective pathway toward scalable, high-quality, and controllable data. Through explicit 3D/4D representations, worlds can be constructed systematically: interactions can be programmatically sampled, states and actions can be composed, rendered into images and videos, and fed back to train video world models. Seen this way, 3D is not the destination it is the starting point for scaling. What truly drives progress forward is never the model itself. Whether we capture the world or imagine new ones, whether data comes from observation or intent, whether we model what is or what should be the direction of the world is ultimately determined by human choice and purpose. Models may extend the world, but humans decide where it goes. #Genie3 #worldmodel

English

261

43.6K

Julen Urain@robotgradient·31 Oca

There has been a clear trend in the last months moving from VLA-type approaches to Video Generative Models + Inverse Dynamics Models (VAM). While the probable main reason of this recent growth is the latest improvements in video generative models, I believe this shift is relevant for robotics. While the VLA's distill the foundation models knowledge through some latent representations that intertwine semantic and spatial information, VAM distill this knowledge in a more explicit way, representing it spatially. I believe this spatial grounding of VAM might lead to way larger generalization capabilities wrt. VLA and I am optimistic in even more 3D spatially grounded foundation models in the direction of the recent @wenlong_huang point-world.github.io

English

140

20.1K

Julen Urain@robotgradient·31 Oca

@junjungoal Very cool! Happy to see the Value-function based approach works that well! Very refreshing approach in front of end-2-end generative model approaches :)

English

113

Jun Yamada@junjungoal·9 Eyl

How can a closed-loop policy safely and robustly grasp novel objects in cluttered environments? We introduce Grasp-MPC: a hybrid of model-based control and data-driven approaches for generalisable and safe 6DoF closed-loop grasping. 🧵👇 (1/N)

English

18.6K

Julen Urain@robotgradient·31 Oca

@DrJimFan Dwarfs Fortress is the perfect fit for this 🥹🥹

English

Jim Fan@DrJimFan·9 Ağu

The famed Stanford Smallville is officially open-source! 25 AI agents inhabit a digital Westworld, unaware that they are living in a simulation. They go to work, gossip, organize socials, make new friends, and even fall in love. Each has unique personality and backstory. Smallville is among the most inspiring AI agent experiments in 2023. We often talk about a single LLM's emergent abilities, but multi-agent emergence could be way more complex and fascinating at scale. A population of AI can play out the evolution of an entire civilization. Endless new possibilities ahead. Gaming will be the first to feel the impact. Github: github.com/joonspk-resear… Paper: arxiv.org/abs/2304.03442 Authors: @joon_s_pk @joseph_c_obrien @carriejcai @merrierm @percyliang @msbernst

English

276

2.2K

9.5K

Julen Urain@robotgradient·29 Oca

While I really liked the article, it feels to me that this physical commonsense can be better capture by predicting next observations (i.e. world models) and planning on it, rather than training a policy on predicting next action (i.e. behavioral cloning)

Andy Zeng@andyzengineer

x.com/i/article/2016…

English

402

Julen Urain@robotgradient·29 Oca

@jparkerholder The spatial-temporal consistency looks suberb! If solid, the implications for robotics are huge! Imagine generating training environments on-the-fly from natural language. Excited to see how this evolves toward embodied agent training.

English

Jack Parker-Holder@jparkerholder·29 Oca

My favorite part of working on Genie for the past few years has been seeing the unexpected things people do with it. Super excited to share Project Genie with US Ultra Users, can't wait to see what you all create with it!!😀😀😀

Google DeepMind@GoogleDeepMind

Step inside Project Genie: our experimental research prototype that lets you create, edit, and explore virtual worlds. 🌎

English

153

10.3K

Julen Urain@robotgradient·29 Oca

@drfeifei 100% on the boat of 3D/4D world models!Generative 3D environments could unlock much broader domain randomization and edge case coverage. I am anyway curious how the physics fidelity compares to hand-crafted sims for contact-rich manipulation tasks.

English

148

Fei-Fei Li@drfeifei·29 Oca

The dream of robots helping people live and work better in the physical world begins with helping robots to become more spatially intelligent by learning from the infinitely diverse and intricate environments of the 3D/4D worlds 🤖🤩

World Labs@theworldlabs

World generation is a bottleneck for robotics. We’re exploring how generative 3D worlds can reduce manual simulation setup and enable broader, more realistic evaluation 🧵

English

155

1.2K

102.4K

Julen Urain@robotgradient·9 Oca

@wenlong_huang @Stanford @nvidia Woow! This is soo cool! Congrats! 3D world Models are 🔥

English

317

Wenlong Huang@wenlong_huang·8 Oca

What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 point-world.github.io from @Stanford @nvidia

English

227

1.3K

234.3K

Julen Urain रीट्वीट किया

Irmak Guzey@irmakkguzey·29 Ara

We just released AINA, a framework for learning robot policies from Aria 2 demos, and are now open-sourcing the code: github.com/facebookresear…. It includes: ✅ Aria 2 data processing into 3D observations like shown ✅Training of point-based policies ✅Calibration Give it a try!

GIF

English

141

22K

Julen Urain@robotgradient·21 Kas

This was very challenging and very cool to see evolve! I personally was no sure if it would work, but @irmakkguzey pushed so hard to show it does. Learning dexterous robot policies with only human video data, using the egocentric view from Aria2 glasses, chill and easy 😁

Irmak Guzey@irmakkguzey

Dexterous manipulation by directly observing humans - a dream in AI for decades - is hard due to visual and embodiment gaps. With simple yet powerful hardware - Aria 2 glasses 👓 - and our new work AINA 🪞, we are now one significant step closer to achieving this dream.

English

827

Julen Urain रीट्वीट किया

Bingyi Kang@bingyikang·14 Kas

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen. 👇(1/n) #DepthAnything3

English

495

3.6K

510.8K

Julen Urain@robotgradient·13 Kas

@Ed__Johns Super impressive and a lot of congratulations 😊

English

405

Edward Johns@Ed__Johns·12 Kas

I'm very excited to finally announce one of the most ambitious projects we've worked on — which makes the front cover of Science Robotics today: ☀️ Learning a Thousand Tasks in a Day ⭐️ Everyday tasks — like those below — can now be learned from a single demonstration each...

English

111

707

108.5K

Julen Urain@robotgradient·7 Kas

@mangahomanga @CMU_Robotics @gupta_abhinav_ @shubhtuls @svlevine @Oliver_Kroemer Sweet! Congrats 😊

English

225

Homanga Bharadhwaj@mangahomanga·7 Kas

Happy to have received a Distinguished Dissertation Honorable Mention for my PhD @CMU_Robotics Grateful to my advisors @gupta_abhinav_ @shubhtuls committee @svlevine @Oliver_Kroemer and to all my supporters, collaborators, and mentors over the years!!!