Kevin Black

126 posts

Kevin Black

@kvablack

phd @berkeley_ai, research @physical_int

Katılım Mart 2018

130 Takip Edilen3K Takipçiler

Kevin Black@kvablack·20h

@Miles_Brundage @davidshustin @jasminewsun no ill will towards the analysis itself, but just to correct the record, it's not affiliated with Physical Intelligence in any way :)

English

Miles Brundage@Miles_Brundage·1d

@davidshustin @jasminewsun Yeah Claude struggled for me as well lol but Codex found it. It's this one x.com/SourishJasti/s…

Sourish Jasti@SourishJasti

1/ General-purpose robotics is the rare technological frontier where the US / China started at roughly the same time and there's no clear winner yet. To better understand the landscape, @zoeytang_1007, @intelchentwo, @vishnuman0 and I spent the last ~8 weeks creating a deep dive on humanoid robotics hardware and flew to China to see the supply chain firsthand. Here's everything we've created + our takeaways about the components, humanoid comparisons, supply chains, and geopolitics👇

English

324

Miles Brundage@Miles_Brundage·1d

My robotics timeline estimates have changed much more in recent years than my AI ones x.com/bayeslord/stat…

bayes@bayeslord

People's timelines for when AI will start affecting the physical world are way too long

English

126

28.9K

Kevin Black@kvablack·12 Mar

oh you're using VLAs? everyone's using GRPs now. just kidding we're all on LBMs. world models are the future so we developed our own WAM. we're using DVAs. we were using UWMs but our robot caught on fire so we switched to DreamUMVLAPs. we're shipping a robot that passes butter.

English

474

29.6K

Kevin Black@kvablack·28 Şub

@chris_j_paxton @notmahi what evidence is there that the aux loss stuff made a huge difference? from my reading of the paper, there are no abaltions that test a Wan backbone with no video prediction loss

English

521

Chris Paxton@chris_j_paxton·28 Şub

Well dreamzero is: - a much bigger model - has this clever auxiliary loss (predicting video) which probably makes its smaller amount of data go a lot farther unfortunately not enough information yet to tell. at least from the comparisons in the paper. it seems like the aux loss stuff made a huge difference (see figure here) but we don't KNOW that pi-0.5 at 14b params wouldn't do well. although it sure seems like it made a difference. i think there's a lot of work to do on the exact best data mixture.

English

3.6K

Mahi Shafiullah 🏠🤖@notmahi·28 Şub

MolmoSpaces leaderboard is now open for submissions! When we created this benchmark for zero-shot real-to-sim eval in diverse homes, we didn’t expect things to heat up so quickly. But it did, thanks to @jang_yoel and team at GEAR toppling PI to take the crown on task-general category. Congrats 🎉 You can evaluate and submit your model to this leaderboard: molmospaces.allen.ai/leaderboard

Joel Jang@jang_yoel

𝐃𝐫𝐞𝐚𝐦𝐙𝐞𝐫𝐨 𝐢𝐬 #𝟏 𝐨𝐧 𝐛𝐨𝐭𝐡 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 𝐚𝐧𝐝 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 🏆 𝗪𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗶𝘀 𝗻𝗼𝘁𝗮𝗯𝗹𝗲: DreamZero-DROID is trained 𝑓𝑟𝑜𝑚 𝑠𝑐𝑟𝑎𝑡𝑐ℎ using only the DROID dataset. No pretraining on large-scale robot data, unlike competing VLAs. This demonstrates the strength of video-model backbones for generalist robot policies (VAMs/WAMs). More broadly, training 𝑜𝑛𝑙𝑦 on real data and evaluating on (1) transparent, distributed benchmarks like 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 or (2) scalable sim-benchmarks like 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 is an exciting step toward fairer and more reproducible evaluation of generalist policies, one that the community can hillclimb together to measure progress. Special thanks to the Ai2 MolmoSpaces team (@notmahi @omarrayyann @YejinKim4 Max Argus) and the RoboArena team (@pranav_atreya) for helping with the set-up and getting these evaluations! Special shout out to @youliangtan @NadunRanawakaA @chuning_zhu, who led these efforts from the GEAR side :) + We also release our DreamZero-AgiBot checkpoint & post-training code to enable very efficient few-shot adaptation. Post-train on just ~30 minutes of play data for your specific robot, and see the robot do basic language following and pick-and-place 🤗(See YAM experiments in our paper for more detail). ++ We also provide the entire codebase & preprocessed dataset to replicate the DreamZero-DROID checkpoint. 🌐 dreamzero0.github.io 💻 github.com/dreamzero0/dre… RoboArena: robo-arena.github.io/leaderboard MolmoSpaces: molmospaces.allen.ai/leaderboard

English

4.4K

Kevin Black@kvablack·27 Şub

@chris_j_paxton I used 238 words/min, which is an average reading speed

English

Chris Paxton@chris_j_paxton·27 Şub

@kvablack How do you compare language hours here? Just time taken to speak vs video?

English

Chris Paxton@chris_j_paxton·26 Şub

We don't have the right pretraining data for robotics, it's important to change very low level features for robotics tasks, and robotics models are probably still too small and trained on too little data

Dominique Paul@DominiqueCAPaul

Interesting to see @physical_int move partner data into pretraining and get far better results than plain SFT. Two ideas as a consequence: ▶︎ RL > SFT, and Pi hasn’t cracked it yet. ▶︎ If expanding to new use cases requires retraining, not just SFT, then this is good news for compute providers & Nvidia

English

6.6K

Kevin Black@kvablack·27 Şub

@chris_j_paxton the sheer number of hours is still absolutely miniscule compared to language, which is more significant I think x.com/kvablack/statu…

Kevin Black@kvablack

My favorite slide that I made for my talk last weekend -- a very silly thought experiment in which we compare language datasets to robotics datasets (in the most shallow way possible). Yes it is to scale; I learned that the maximum shape size in Keynote is 20,000pts

English

885

Chris Paxton@chris_j_paxton·26 Şub

people talk about the sheer number of hours and of tokens in robotics data, but it's just not the same as language. language data is very semantically rich; there's no fluff. in a video dataset, it's basically all fluff; a small subset of the image is enough to determine all that's relevant to the task. big gap to close.

English

863

Kevin Black@kvablack·25 Şub

this is probably our most important release so far, even though it has nothing to do with research progress

Physical Intelligence@physical_int

General-purpose AI models are behind some of the most exciting applications we now can't live without. We envision that an analogous “physical intelligence layer” built with models like π0.6 will similarly spur a new wave of applications for the physical world. We’ve recently begun working with a handful of companies that have deployed their robots to do real-world, useful things. pi.website/blog/partner/?…

English

8.4K

Kevin Black retweetledi

SAIL Media@readsail·3 Şub

Robots have a "latency" problem. 🤖 💨 @kvablack explains how to use diffusion models and "Action Chunking" to make robot movements seamless—even when the AI is still "thinking." Watch the full clip on YT! Link in replies.

English

Kevin Black@kvablack·17 Ara

@kenbwork sure, I mean that "the literal error bar is symmetric when it consists of +-1 SEM". I think most would know that's what I (or Generalist) mean when we say "plotting the standard error".

English

280

Kenny Workman@kenbwork·17 Ara

@kvablack Again, think conflating a specific use of the SE (Normal approximating confidence intervals using CLT) with what the SE is. SE is a *number*: \sqrt{\Var{\hat{\theta}}}}. There are many other ways to get asymmetric error bars from this number

English

254

Kevin Black@kvablack·17 Ara

I know I'm the only robot learning researcher to ever care about statistical rigor, but technically you shouldn't use standard error for a binary success rate. The binomial distribution isn't symmetrical 😅

Generalist@GeneralistAI

More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts). Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training. See blog addendum: generalistai.com/blog/nov-04-20…

English

170

24.7K

Kevin Black@kvablack·17 Ara

@kenbwork I mean that the literal error bar is symmetric about the sample mean when it's based on SE

English

312

Kenny Workman@kenbwork·17 Ara

@kvablack SE doesn't assume symmetry in general. You might be conflating CLT-based estimates of SE with the definition of the SE itself?

English

285

Kevin Black@kvablack·17 Ara

@kenbwork you're right that it depends on how they're pooling though. if they're averaging multiple proportions then it's no longer binomial. not sure what you can do then besides do a lot more trials. or maybe just presenting the data per-task (unpooled) is better.

English

401

Kevin Black@kvablack·17 Ara

@kenbwork the SE is symmetric bc it relies on the CLT, which is fine for arbitrary distributions and a large enough sample size. but if you have a smaller sample size and you know the distribution is binomial, you can do better (e.g., Wilson score interval)

English

818

Kevin Black@kvablack·17 Ara

@aliuahma if you look it up it seems like the rule of thumb is np>10, which it doesn't seem like they have. but in practice I don't see a reason to ever use the normal approximation, especially with proportions near 0 or 1

English

827

ali@aliuahma·17 Ara

@kvablack isn't it okay as long as your sample proportion is calculated from a large number of trials? by CLT the distribution of p_hat approaches normality so using standard error is suitable

English

Kevin Black@kvablack·9 Ara

@Christian061145 it all depends on your constraints. inference-time RTC is still more convenient. however, we already do a lot of post-training so we may as well add something there, and this simple method seems to work well enough. I'm sure ppl will find other methods that work better.

English

280

0xCC@Christian061145·9 Ara

@kvablack Nice work! Do you think it will replace the more complex RTC alternatives in the near future?

English

474

Kevin Black@kvablack·8 Ara

Last week I presented real-time chunking (RTC) at NeurIPS, and we did a live coffee demo the very same evening. To celebrate, we're releasing a (very short) follow-up paper describing a training-time variant of RTC, which is what we've actually been using in our demos!

English

440

33.3K

Kevin Black@kvablack·9 Ara

@m0hitsharma "d" in the paper includes network, but yeah, you need to know the rough range of delays at training time

English

393

Mohit Sharma@m0hitsharma·9 Ara

Cool paper, next is what happens if you run a huge (30+ B model) which doesn’t even run locally. We have network delay and inference delay, and they can always be larger than the assumed “d” in the paper (ie delay during training), what do you do then? Also choosing hparams correctly is critical.

Kevin Black@kvablack

English

825

Kevin Black@kvablack·8 Ara

Check out the paper here: arxiv.org/abs/2512.05964 And the code: github.com/Physical-Intel… Hopefully some people find this useful!

English

2.7K

Kevin Black@kvablack·8 Ara

The method is stupidly simple -- we simulate delay at training time by conditioning on action prefixes. It only takes about 8 lines of code to implement, but it works just as well as inference-time RTC without the extra computational overhead.

English

3.7K

Kevin Black@kvablack·25 Kas

hey it's me, many students

Hongyu Li@Hongyu_Lii

Robotics version: I have met many students who claim to be working on robotics but do not even know the basics of kinematics or motion planning.

English

282

45.4K

Kevin Black retweetledi

Physical Intelligence@physical_int·18 Kas

Our model can now learn from its own experience with RL! Our new π*0.6 model can more than double throughput over a base model trained without RL, and can perform real-world tasks: making espresso drinks, folding diverse laundry, and assembling boxes. More in the thread below.

English

323

2.1K

696.1K

Keşfet

@Miles_Brundage @davidshustin @jasminewsun @chris_j_paxton @notmahi @jang_yoel @kenbwork @elonmusk