Kyle Stachowicz

81 posts

Kyle Stachowicz

@KyleStachowicz

Robot learning @berkeley_ai @physical_int

Berkeley, CA Katılım Ağustos 2018

279 Takip Edilen872 Takipçiler

Kyle Stachowicz@KyleStachowicz·5d

@KyleVedder @vriishin @saurabhtwq @KyleMorgenstein robokyle group chat when

English

Kyle Vedder@KyleVedder·5d

@vriishin @saurabhtwq @KyleMorgenstein dont forget @KyleStachowicz 😹

English

152

saurabh@saurabhtwq·5d

would love to follow more people who working in vlm, world models and robotics. any recs ?

English

134

10.3K

Kyle Stachowicz@KyleStachowicz·8 Mar

@gf_256 tl2x pipeline

Indonesia

737

cts🌸@gf_256·8 Mar

Claude and Codex max walk into a bar. That’s the end of the joke. Claude hit the usage limit

English

122

3.3K

97.8K

Kyle Stachowicz@KyleStachowicz·4 Mar

Partial observability means that a robot policy - even with infinitely many demonstrations - will still be worse than the demonstrator. With MEM, we built a recipe to close this gap. Fantastic work led by @KarlPertsch @marceltornev @DannyDriess.

Physical Intelligence@physical_int

We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇

English

Kyle Stachowicz@KyleStachowicz·1 Mar

@chris_j_paxton heelys were always the final form of human locomotion

English

169

Chris Paxton@chris_j_paxton·1 Mar

What is the advantage of wheels though

Chris Paxton@chris_j_paxton

Its cool to see this robot stand up

English

121

26.6K

Kyle Stachowicz retweetledi

Tian Gao@TianGao_19·16 Şub

Long-tail scenarios remain a major challenge for autonomous driving. Unusual events—like accidents or construction zones—are underrepresented in driving data, yet require semantic and commonsense reasoning grounded in control. We propose SteerVLA, a framework that uses VLM reasoning to steer a driving policy via grounded, fine-grained language instructions. Paper: arxiv.org/abs/2602.08440 Website: steervla.github.io

English

176

69K

Kyle Stachowicz@KyleStachowicz·9 Şub

@liu730chaoqi Have you found reconstruction fidelity to be a bottleneck at all? During FAST I had a tough time getting flat FSQ tokenizers precise enough for our more dexterous tasks...(see the scaling curves wrt tokens)

English

Kyle Stachowicz@KyleStachowicz·9 Şub

@liu730chaoqi Hah...maybe we should have given it a bit more discussion in the original paper (e.g., the version where you transpose the ordering is unsurprisingly terrible)

English

Chaoqi Liu@liu730chaoqi·8 Şub

Yes, I really like FAST’s approach to action ordering—though it only gets a brief mention in the paper. I see OAT as a small step toward better understanding this space. Hopefully, it encourages more careful and principled thinking around how we generate action chunks. 😀

Kyle Stachowicz@KyleStachowicz

@liu730chaoqi Great writeup! Seems to clean up one of my least favorite parts of FAST (variable-width tokens are awful for decoding, and I suspect for learning signal) while keeping the token ordering that makes it work in the first place! Looking forward to trying it out :)

English

2.9K

Kyle Stachowicz@KyleStachowicz·8 Şub

English

4.5K

Chaoqi Liu@liu730chaoqi·8 Şub

x.com/i/article/2020…

ZXX

739

133.8K

Kyle Stachowicz retweetledi

Chris Paxton@chris_j_paxton·19 Oca

It's interesting (and good) to sed tje Macdonalds worker as the "hard mode" of robot intelligence. Lots of people dont, i think, have a good picture of what work will be hard for robots and what won't. High mix, chaotic work like this is the dream but its hard to pull off

Packy McCormick@packyM

Spend an hour reading this weekend and I think you’ll know more about robotics than 99% of people, including some people who invest in robotics. notboring.co/p/robot-steps

English

Kyle Stachowicz@KyleStachowicz·10 Oca

@alz_zyd_ I think it's actually held up fine? The divergence in the slide is from training off-policy; training on-policy (i.e. RL/reasoning) takes you out of this regime completely because it allows correcting past mistakes, and it's empirically what has made long-CoT responses viable.

English

alz@alz_zyd_·9 Oca

this argument sounded smart at the time but somehow turned out to be pretty bad

Yann LeCun@ylecun

I have claimed that Auto-Regressive LLMs are exponentially diverging diffusion processes. Here is the argument: Let e be the probability that any generated token exits the tree of "correct" answers. Then the probability that an answer of length n is correct is (1-e)^n 1/

English

159

43.3K

Kyle Stachowicz@KyleStachowicz·3 Oca

@ID_AA_Carmack I mean, if your code is bottlenecked on indexing operations rather than the billion-parameter matmuls you probably screwed up already, and going from i64 to u32 isn't going to fix it...

English

346

John Carmack@ID_AA_Carmack·3 Oca

Pytorch made the right call standardizing on signed 64 bit indexes. I would probably still be rather pointlessly making case by case decisions to use int32 if it were an option. Some old habits linger.

English

513

64.8K

Kyle Stachowicz@KyleStachowicz·30 Ara

@arnie_hacker @drfeifei If you have a world model good enough to replace real-world evals you've probably solved robotics already. Imperfect world models probably give a passable metric, but I'm skeptical that just hill-climbing WM evals will lead to practical gains.

English

Arnie Ramesh@arnie_hacker·29 Ara

@KyleStachowicz @drfeifei How well do world models work for evals?

English

131

Arnie Ramesh@arnie_hacker·29 Ara

Evals for robotics is so far behind Afaik it’s running your trained model on tasks you define BEHAVIOUR benchmark from @drfeifei is promising, but models trained on real-world data require testing with real-world observations Bullish on 3DGS-based simulators for this reason

Jim Fan@DrJimFan

Everyone's freaking out about vibe coding. In the holiday spirit, allow me to share my anxiety on the wild west of robotics. 3 lessons I learned in 2025. 1. Hardware is ahead of software, but hardware reliability severely limits software iteration speed. We've seen exquisite engineering arts like Optimus, e-Atlas, Figure, Neo, G1, etc. Our best AI has not squeezed all the juice out of these frontier hardware. The body is more capable than what the brain can command. Yet babysitting these robots demands an entire operation team. Unlike humans, robots don't heal from bruises. Overheating, broken motors, bizarre firmware issues haunt us daily. Mistakes are irreversible and unforgiving. My patience was the only thing that scaled. 2. Benchmarking is still an epic disaster in robotics. LLM normies thought MMLU & SWE-Bench are common sense. Hold your 🍺 for robotics. No one agrees on anything: hardware platform, task definition, scoring rubrics, simulator, or real world setups. Everyone is SOTA, by definition, on the benchmark they define on the fly for each news announcement. Everyone cherry-picks the nicest looking demo out of 100 retries. We gotta do better as a field in 2026 and stop treating reproducibility and scientific discipline as second-class citizens. 3. VLM-based VLA feels wrong. VLA stands for "vision-language-action" model and has been the dominant approach for robot brains. Recipe is simple: take a pretrained VLM checkpoint and graft an action module on top. But if you think about it, VLMs are hyper-optimized to hill-climb benchmarks like visual question answering. This implies two problems: (1) most parameters in VLMs are for language & knowledge, not for physics; (2) visual encoders are actively tuned to *discard* low-level details, because Q&A only requires high-level understanding. But minute details matter a lot for dexterity. There's no reason for VLA's performance to scale as VLM parameters scale. Pretraining is misaligned. Video world model seems to be a much better pretraining objective for robot policy. I'm betting big on it.

English

15.1K

Kyle Stachowicz@KyleStachowicz·23 Ara

@liangpan_t Tokenization can overfit too. This is more a property of the data - you can get multimodal behaviors when you train on much larger data. The paper looks at small-ish sim datasets, but they're right that the main benefit of diffusion is *not* that it picks up all of the modes.

English

289

Liang Pan@liangpan_t·22 Ara

So the only way to learn multimodal behaviors is to tokenize actions and then use cross-entropy loss?

Chaoyi Pan@ChaoyiPan

Generative models (diffusion/flow) are taking over robotics 🤖. But do we really need to model the full action distribution to control a robot? We suspected the success of Generative Control Policies (GCPs) might be "Much Ado About Noising." We rigorously tested the myths. 🧵👇

English

8.7K

Kyle Stachowicz@KyleStachowicz·23 Ara

@chris_j_paxton This! I think there was a while where it was unclear how much pretraining really helps (e.g. some of the ablations in π0), but since the π0.5 release we've dialed in our pretraining recipe a lot and it's enabled some awesome results (π0.6, olympics tasks, human video transfer).

English

419

Chris Paxton@chris_j_paxton·22 Ara

The aside fact here that training "from scratch" didn't work but fine-tuning did, is extremely important

Physical Intelligence@physical_int

All videos are autonomous. We also tested training "from scratch" (from a VLM initialization), but this failed on all tasks, indicating that fine-tuning our models is essential for success. For more, check out our blog post: pi.website/blog/olympics

English

165

22.8K

Kyle Stachowicz@KyleStachowicz·22 Ara

You can do a shocking amount of dexterous tasks with really simple hardware, with a great pretrained foundation model and high-quality data collection strategies :)

Physical Intelligence@physical_int

We got our robots to wash pans, clean windows, make peanut butter sandwiches, and more! Fine-tuning our latest model enables all of these tasks, and this has interesting implications for robotics, Moravec's paradox, and the future of large models in embodied AI. More below!

English

3.5K

Kyle Stachowicz@KyleStachowicz·18 Ara

@chris_j_paxton @thisismyhat If you try to sell crappy shovels the miners will be the first to call you on it

English

210

Chris Paxton@chris_j_paxton·18 Ara

@thisismyhat There's still definitely such a thing as bad data sadly

English

726

Brian Cheung@thisismyhat·17 Ara

There's no such thing as bad data, only bad learning.

Physical Intelligence@physical_int

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

English

8.3K

Kyle Stachowicz@KyleStachowicz·18 Ara

@satpalsr @JieWang_ZJUI It’s hard to make a robust sensor that gives rich touch feedback, but a wrist camera gives you lots of the same local feedback.

English

Satpal Singh Rathore@7xpal·18 Ara

@JieWang_ZJUI What's the reason behind 'Secret of VLAs is always the wrist camera'?

English

270

Jie Wang@JieWang_ZJUI·18 Ara

Very cool emergent capability of human-robot co-training! But I have to point out: We haven’t got a free lunch of learning in the wild YouTube video. 1) It’s still human augmentation. 2) Secret of VLAs is always the wrist camera. 3) Teleoperators have to shape their hands like grippers. This restricts the flexibility and dexterity. How it works may be because VLA resolution is low enough (224/448), so hands “look like” grippers for near-sighted policy. Hand data is more like a trajectory planner for ideal visual behavior. Robot data is still essential to ground actions. The t-SNE embedding visualization is most exciting! I like how PI presents correlation of human and robot data here. Looking forward to seeing more contact-rich tasks, excited to see we went one step further with scaling and co-training.

Physical Intelligence@physical_int

English

176

17.6K

Kyle Stachowicz@KyleStachowicz·17 Ara

I got *soooo* excited when @simar_kareer showed me this figure for the first time a few months back. Better pretraining directly unlocks better transfer, and not just between robots - humans are just another embodiment. And there's lots more in the pipeline at π!

Physical Intelligence@physical_int

This also shows up in the representations learned by the model. We plot the model’s representations of human and robot images. As pre-training is scaled up, the representation of humans and robots become more aligned: to a scaled-up model, human videos "look" like robot demos.

English

146

21.6K

Kyle Stachowicz@KyleStachowicz·28 Kas

@KyleVedder $ mv experimental/kyle experimental/kyles $ mkdir experimental/kylev looking forward to having you on board soon!

English

332

Kyle Vedder@KyleVedder·28 Kas

Personal Update: After a year at Dyna Robotics I'm joining Physical Intelligence as a Researcher! I'm very proud of my contributions at Dyna, including the DYNA-1 Reward Model and continuous demos. I'm excited to start a new chapter focused on building *self improving* systems for robots -- I think this is the most important challenge in robot learning, and it's still completely unsolved. PS: I'm also going to be at NeurIPS, so come say hi!

English

311

24.3K

Keşfet

@KyleVedder @vriishin @saurabhtwq @KyleMorgenstein @gf_256 @KarlPertsch @marceltornev @DannyDriess