Yanjiang Guo

90 posts

Yanjiang Guo

@Yanjiang_Guo

CS PhD & EE Undergrad @Tsinghua_Uni. Visiting PhD Student @Stanford.

Katılım Kasım 2021

426 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Yanjiang Guo@Yanjiang_Guo·17 Şub

Excited to share VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model We explore improving VLA inside a learned world model, and find that the key is to jointly improve VLA and WM! Website: sites.google.com/view/vlaw-arxiv

English

272

57.9K

Yanjiang Guo@Yanjiang_Guo·2d

I am surprised that so many pre-trained knowledge can be preserved with no additional data if you finetune VLA in a proper way! Check this solid work from Suning!

Suning Huang@suning_huang

🤖Low-data post-training can teach a VLA policy a new robot skill. But it also makes it too attached to the training demos. We call this lock-in🔒: the policy can execute the post-training task, yet fails to respond to seemingly obvious prompt changes. DeLock preserves steerability using only the policy’s own pretrained knowledge. No extra supervision needed!🚀🚀🚀 #Robotics #AI #EmbodiedAI #VLA

English

Yanjiang Guo retweetledi

Joy He-Yueya@JoyHeYueya·17 Nis

Scientists often make breakthroughs by synthesizing ideas across papers. In our new paper, we ask whether a language model can anticipate this process: given two parent papers, can it generate the core insight of a future paper built on them? 🧵⬇️

English

734

181.5K

Yanjiang Guo retweetledi

Chelsea Finn@chelseabfinn·16 Nis

LLM post-training used to mean fine-tuning to a downstream task Robotics has been stuck in this setting, needing task-specific fine-tuning for best performance π07 changes this: It works out of the box & outperforms fine-tuned specialists Details: pi.website/pi07

English

556

53.7K

Yanjiang Guo@Yanjiang_Guo·17 Nis

In a true research, you start with some assumption and maybe end up with something else! A lot to learn in this thread :)

Lucy Shi@lucy_x_shi

1/ We just released π0.7 — a steerable generalist robot model with emergent capabilities. I want to share a bit of the backstory, because π0.7 taught me something surprising about where robot learning is heading. A thread on bittersweet lessons 🧵

English

2.6K

Yanjiang Guo retweetledi

Anirudha Majumdar@Majumdar_Ani·17 Mar

x.com/i/article/2033…

ZXX

397

90.4K

Yanjiang Guo retweetledi

ali@aliuahma·11 Mar

i'm so excited to finally share what we've been working on @rhodaai! taking lessons from success in LLMs, we identify autoregressive video generation as a scalable objective for training robot policies. 1/n

Jagdeep Singh@startupjag

After operating in stealth for the last 18 months @rhodaai , we’re excited today to finally show the world what we’ve been working on. We believe we’re on a path to physical AGI with the launch of our brand new foundation model, the Direct Video Action (DVA) model.

English

101

15K

Yanjiang Guo retweetledi

Yunzhu Li@YunzhuLiYZ·5 Mar

For a long time, I was skeptical about action-conditioned video prediction for robotics. Many models look impressive, but once you ask them to handle long-horizon manipulation with real physical interaction, things quickly fall apart (e.g., Genie is amazing but mostly focused on navigation). This project changed my mind. I'm beyond excited to share Interactive World Simulator, a project we have been working on for the past ~1.5 years 🤖 One of the first world models that produces convincing results for long-horizon robotic manipulation involving complex physical interactions, across a diverse range of objects (rigid objects, deformables, ropes, object piles). It directly unlocks scalable data generation for robotic policy training and policy evaluation. Try it yourself (no installation needed): yixuanwang.me/interactive_wo… Play directly with the simulator in your browser. Key Takeaways: 1️⃣ 15 Hz long-horizon action-conditioned video prediction for 10+ minutes on a single RTX 4090 GPU 2️⃣ Visual and dynamic fidelity: people often ask how much sim data equals one real data point. In our experiments, it turns out to be close to one-to-one using the Interactive World Simulator 3️⃣ Stress testing matters: we emphasize interactive stress testing to understand robustness and stability and to build trust in the simulator 4️⃣ The model is trained with only ~6 hours of real-world random interaction data on a single GPU. Imagine what happens if we scale this 1000× or even 1M× Huge credit to @YXWangBot, who led this effort with countless hours of work on data collection, training recipes, and system design. I'm incredibly proud of the work he did here! Enjoy the demos and videos. We also fully open-sourced the codebase for anyone interested in applying this to their own tasks. #Robotics #RobotLearning #WorldModels #EmbodiedAI

Yixuan Wang@YXWangBot

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at #interactive-demo" target="_blank" rel="nofollow noopener">yixuanwang.me/interactive_wo…. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

English

368

75.3K

Yanjiang Guo@Yanjiang_Guo·6 Mar

@RoboPapers Thank you so much for having us!💕

English

RoboPapers@RoboPapers·5 Mar

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with @chris_j_paxton and @DJiafei!

English

123

23.9K

Yanjiang Guo retweetledi

Marcel Torné@marceltornev·4 Mar

We equipped PI policies with memory! And taught our robots to do long-horizon real world tasks such as preparing the items for a recipe, cooking a grilled cheese and cleaning the kitchen!

Physical Intelligence@physical_int

We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇

English

9.7K

Yanjiang Guo retweetledi

Chelsea Finn@chelseabfinn·4 Mar

We added short-term visual memory + long-term text memory to pi models. 🤖 Enables robots to: - complete tasks up to 15 min long - cook grilled cheese while keeping track of time - adapt in-context Paper & videos: pi.website/memory

English

491

30.9K

Yanjiang Guo retweetledi

Lihan Zha@LihanZha·17 Şub

Today's state-of-the-art VLAs struggle to generalize zero-shot to new robot embodiments, despite training on extensive multi-embodiment data. We introduce Language-Action Pre-training (LAP) and LAP-3B — the first VLA to achieve substantial zero-shot transfer to unseen real-world robot embodiments, through simply aligning action representation with language. Everything is open-sourced! Try it out on your own robot: 🌐 lap-vla.github.io

English

338

58.6K

Yanjiang Guo retweetledi

Tian Gao@TianGao_19·16 Şub

Long-tail scenarios remain a major challenge for autonomous driving. Unusual events—like accidents or construction zones—are underrepresented in driving data, yet require semantic and commonsense reasoning grounded in control. We propose SteerVLA, a framework that uses VLM reasoning to steer a driving policy via grounded, fine-grained language instructions. Paper: arxiv.org/abs/2602.08440 Website: steervla.github.io

English

176

70.3K

Yanjiang Guo@Yanjiang_Guo·17 Şub

Huge thanks to our project co-leads @tonyh_lee and @lucy_x_shi, and our advisors @chelseabfinn, @percyliang, and @JianyuChen_THU!

English

609

Yanjiang Guo@Yanjiang_Guo·17 Şub

It's also excited to see many recent papers who also practice VLAs inside world models: World-Gymnast: arxiv.org/abs/2602.02454 World-VLA-Loop: arxiv.org/abs/2602.02454 WMPO: arxiv.org/abs/2511.09515 We tried a policy-agnostic style improvement with expressive π0.5 policy!

English

724

Yanjiang Guo@Yanjiang_Guo·17 Şub

English

272

57.9K

Yanjiang Guo retweetledi

Ruiqian Nai@RuiqianNai·10 Şub

🤖 Can we demonstrate humanoid complex whole-body manipulation skills without a physical robot present? Introducing HuMI: A portable, robot-free interface for learning diverse humanoid manipulation tasks. 📄 arxiv.org/abs/2602.06643 🌐 …noid-manipulation-interface.github.io

English

283

40.3K

Keşfet

@rhodaai @YXWangBot @RoboPapers @chris_j_paxton @DJiafei @tonyh_lee @lucy_x_shi @chelseabfinn