Julen Urain

285 posts

Julen Urain banner
Julen Urain

Julen Urain

@robotgradient

Robotics Tinkerer. RS @Amazon FAR Prev: @META (FAIR), @DFKI, @TUDarmstadt https://t.co/RQpq7Prbln X https://t.co/umZQeDjJv4

शामिल हुए Kasım 2017
1.4K फ़ॉलोइंग1.2K फ़ॉलोवर्स
Julen Urain रीट्वीट किया
Yixuan Wang
Yixuan Wang@YXWangBot·
1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at #interactive-demo" target="_blank" rel="nofollow noopener">yixuanwang.me/interactive_wo…. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning
English
25
83
470
107.1K
Julen Urain रीट्वीट किया
Jitendra MALIK
Jitendra MALIK@JitendraMalikCV·
Pretraining with dynamics models of motor behavior (aka world models) from video will be much more central to robotics than VLMs. There are multiple choices of representations (e.g. 3D? JEPA?) but we will figure this out by and by. Exciting times!
Jim Fan@DrJimFan

- Project website: dreamdojo-world.github.io - Paper: arxiv.org/abs/2602.06949 - Code repo and model ckpts: github.com/NVIDIA/DreamDo… This is a huge team work at NVIDIA. All credits go to the wonderful teams who poured their hearts into it!

English
12
27
365
81.3K
Julen Urain रीट्वीट किया
Wenlong Huang
Wenlong Huang@wenlong_huang·
Fully agreed with the sentiment that much of computer vision research (concretely, those not for “human consumption”) should be grounded in robotics. But as a robotics researcher, I think the more nuanced question is: how can we *rethink* these intermediate representations for embodied intelligence rather than discarding them? Why? The challenge, as also pointed out in Vincent’s article, is precisely the lack of perception-action data at scale. This is why intermediate representations IMO are *preferable rather than obsolete* because they open up training from scalable data sources. This can include even the vision/language encoders people love and use in robot learning — it’s hard to imagine training low-level visual representation or high-level language understanding purely from limited robot data. The same goes for intermediate representations at the structure level — world modeling, learning from Internet videos, learning from humans, and simulation — many of which still rely on 3D representations too.
Vincent Sitzmann@vincesitzmann

In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…

English
1
5
80
10.7K
Julen Urain रीट्वीट किया
Jitendra MALIK
Jitendra MALIK@JitendraMalikCV·
At the RI seminar at CMU yesterday, I presented a 3 level analysis of robot skills & discussed the pros and cons of teleoperation, simulation, and learning from videos, before presenting our research. Enjoy! youtube.com/watch?v=ry8iti…
YouTube video
YouTube
English
9
38
358
100.8K
Julen Urain
Julen Urain@robotgradient·
@artemZholus It reminds me to TD learning or even GAIL. I am not convinced of bootsrapping for generative models .
English
0
0
1
88
Artem Zholus
Artem Zholus@artemZholus·
I am reading the drifting models paper and I am very excited about it! One observation: I think the drifting field is an approximation of the gradient field of linear critic from Wasserstein GANs, or at least they are closely related. what do you think?
English
1
1
2
381
Chris Paxton
Chris Paxton@chris_j_paxton·
One side note is that @notmahi is one of the fairly few people who's been consistently NOT training foundation models, but instead aiming to train tiny models that actually just work anywhere, and this is kind of the obvious endgame of that philosophy
C Zhang@ChongZitaZhang

okay, actually yes

English
2
4
74
7.5K
Julen Urain रीट्वीट किया
Hao Zhang
Hao Zhang@HaoZhang623·
As video world models become increasingly powerful, do we still need explicit 3D? A commonly misunderstood point is this: video world models are not “just 2D.” Their ability to maintain multi-view consistency, temporal stability, and realistic interaction necessarily implies that their latent knowledge encodes 3D world structure. Without some notion of 3D, consistency itself would not be possible. The real distinction, therefore, is not whether a model has 3D but whether that 3D exists implicitly or explicitly. Implicit 3D lives inside latent spaces and network weights. It supports generation, but it is difficult to localize, edit, constrain, or reason about. It allows the world to exist, but not to be used. Explicit 3D, in contrast, exists as structure and state: it is addressable, editable, composable, and transferable. Its purpose is not better visual fidelity, but operability to allow the world to be manipulated, controlled, and executed. From this perspective, video and 3D are not competing paradigms but a layered system: 2D/video is the interface to human perception; 3D is the interface to the physical world. They can reinforce each other, but neither forms a closed loop on its own. In practice, data not model architecture sets the upper bound of world models. Explicit 3D may not be the final user-facing representation, but it is likely the most effective pathway toward scalable, high-quality, and controllable data. Through explicit 3D/4D representations, worlds can be constructed systematically: interactions can be programmatically sampled, states and actions can be composed, rendered into images and videos, and fed back to train video world models. Seen this way, 3D is not the destination it is the starting point for scaling. What truly drives progress forward is never the model itself. Whether we capture the world or imagine new ones, whether data comes from observation or intent, whether we model what is or what should be the direction of the world is ultimately determined by human choice and purpose. Models may extend the world, but humans decide where it goes. #Genie3 #worldmodel
English
11
23
261
43.6K
Julen Urain
Julen Urain@robotgradient·
There has been a clear trend in the last months moving from VLA-type approaches to Video Generative Models + Inverse Dynamics Models (VAM). While the probable main reason of this recent growth is the latest improvements in video generative models, I believe this shift is relevant for robotics. While the VLA's distill the foundation models knowledge through some latent representations that intertwine semantic and spatial information, VAM distill this knowledge in a more explicit way, representing it spatially. I believe this spatial grounding of VAM might lead to way larger generalization capabilities wrt. VLA and I am optimistic in even more 3D spatially grounded foundation models in the direction of the recent @wenlong_huang point-world.github.io
English
2
16
140
20.1K
Julen Urain
Julen Urain@robotgradient·
@junjungoal Very cool! Happy to see the Value-function based approach works that well! Very refreshing approach in front of end-2-end generative model approaches :)
English
1
0
1
113
Jun Yamada
Jun Yamada@junjungoal·
How can a closed-loop policy safely and robustly grasp novel objects in cluttered environments? We introduce Grasp-MPC: a hybrid of model-based control and data-driven approaches for generalisable and safe 6DoF closed-loop grasping. 🧵👇 (1/N)
English
2
7
49
18.6K
Julen Urain
Julen Urain@robotgradient·
@DrJimFan Dwarfs Fortress is the perfect fit for this 🥹🥹
English
0
0
0
23
Jim Fan
Jim Fan@DrJimFan·
The famed Stanford Smallville is officially open-source! 25 AI agents inhabit a digital Westworld, unaware that they are living in a simulation. They go to work, gossip, organize socials, make new friends, and even fall in love. Each has unique personality and backstory. Smallville is among the most inspiring AI agent experiments in 2023. We often talk about a single LLM's emergent abilities, but multi-agent emergence could be way more complex and fascinating at scale. A population of AI can play out the evolution of an entire civilization. Endless new possibilities ahead. Gaming will be the first to feel the impact. Github: github.com/joonspk-resear… Paper: arxiv.org/abs/2304.03442 Authors: @joon_s_pk @joseph_c_obrien @carriejcai @merrierm @percyliang @msbernst
Jim Fan tweet media
English
276
2.2K
9.5K
4M
Julen Urain
Julen Urain@robotgradient·
While I really liked the article, it feels to me that this physical commonsense can be better capture by predicting next observations (i.e. world models) and planning on it, rather than training a policy on predicting next action (i.e. behavioral cloning)
Andy Zeng@andyzengineer

x.com/i/article/2016…

English
1
0
4
402
Julen Urain
Julen Urain@robotgradient·
@jparkerholder The spatial-temporal consistency looks suberb! If solid, the implications for robotics are huge! Imagine generating training environments on-the-fly from natural language. Excited to see how this evolves toward embodied agent training.
English
0
0
1
73
Julen Urain
Julen Urain@robotgradient·
@drfeifei 100% on the boat of 3D/4D world models!Generative 3D environments could unlock much broader domain randomization and edge case coverage. I am anyway curious how the physics fidelity compares to hand-crafted sims for contact-rich manipulation tasks.
English
0
0
0
148
Fei-Fei Li
Fei-Fei Li@drfeifei·
The dream of robots helping people live and work better in the physical world begins with helping robots to become more spatially intelligent by learning from the infinitely diverse and intricate environments of the 3D/4D worlds 🤖🤩
World Labs@theworldlabs

World generation is a bottleneck for robotics. We’re exploring how generative 3D worlds can reduce manual simulation setup and enable broader, more realistic evaluation 🧵

English
62
155
1.2K
102.4K
Wenlong Huang
Wenlong Huang@wenlong_huang·
What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 point-world.github.io from @Stanford @nvidia
English
23
227
1.3K
234.3K
Julen Urain रीट्वीट किया
Irmak Guzey
Irmak Guzey@irmakkguzey·
We just released AINA, a framework for learning robot policies from Aria 2 demos, and are now open-sourcing the code: github.com/facebookresear…. It includes: ✅ Aria 2 data processing into 3D observations like shown ✅Training of point-based policies ✅Calibration Give it a try!
GIF
English
4
32
141
22K
Julen Urain
Julen Urain@robotgradient·
This was very challenging and very cool to see evolve! I personally was no sure if it would work, but @irmakkguzey pushed so hard to show it does. Learning dexterous robot policies with only human video data, using the egocentric view from Aria2 glasses, chill and easy 😁
Irmak Guzey@irmakkguzey

Dexterous manipulation by directly observing humans - a dream in AI for decades - is hard due to visual and embodiment gaps. With simple yet powerful hardware - Aria 2 glasses 👓 - and our new work AINA 🪞, we are now one significant step closer to achieving this dream.

English
2
0
9
827
Julen Urain रीट्वीट किया
Bingyi Kang
Bingyi Kang@bingyikang·
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen. 👇(1/n) #DepthAnything3
English
80
495
3.6K
510.8K
Julen Urain
Julen Urain@robotgradient·
@Ed__Johns Super impressive and a lot of congratulations 😊
English
0
0
1
405
Edward Johns
Edward Johns@Ed__Johns·
I'm very excited to finally announce one of the most ambitious projects we've worked on — which makes the front cover of Science Robotics today: ☀️ Learning a Thousand Tasks in a Day ⭐️ Everyday tasks — like those below — can now be learned from a single demonstration each...
English
32
111
707
108.5K
Stone Tao
Stone Tao@Stone_Tao·
sim2real experiment died because a quaternion flipped into a subspace my rl policy has never seen before… that’s a new one for me 🥲
Stone Tao tweet media
English
17
8
242
16.8K