Kevin Black

126 posts

Kevin Black

Kevin Black

@kvablack

phd @berkeley_ai, research @physical_int

Katılım Mart 2018
130 Takip Edilen3K Takipçiler
Miles Brundage
Miles Brundage@Miles_Brundage·
@davidshustin @jasminewsun Yeah Claude struggled for me as well lol but Codex found it. It's this one x.com/SourishJasti/s…
Sourish Jasti@SourishJasti

1/ General-purpose robotics is the rare technological frontier where the US / China started at roughly the same time and there's no clear winner yet. To better understand the landscape, @zoeytang_1007, @intelchentwo, @vishnuman0 and I spent the last ~8 weeks creating a deep dive on humanoid robotics hardware and flew to China to see the supply chain firsthand. Here's everything we've created + our takeaways about the components, humanoid comparisons, supply chains, and geopolitics👇

English
1
0
4
324
Kevin Black
Kevin Black@kvablack·
oh you're using VLAs? everyone's using GRPs now. just kidding we're all on LBMs. world models are the future so we developed our own WAM. we're using DVAs. we were using UWMs but our robot caught on fire so we switched to DreamUMVLAPs. we're shipping a robot that passes butter.
English
17
38
474
29.6K
Kevin Black
Kevin Black@kvablack·
@chris_j_paxton @notmahi what evidence is there that the aux loss stuff made a huge difference? from my reading of the paper, there are no abaltions that test a Wan backbone with no video prediction loss
English
1
0
10
521
Chris Paxton
Chris Paxton@chris_j_paxton·
Well dreamzero is: - a much bigger model - has this clever auxiliary loss (predicting video) which probably makes its smaller amount of data go a lot farther unfortunately not enough information yet to tell. at least from the comparisons in the paper. it seems like the aux loss stuff made a huge difference (see figure here) but we don't KNOW that pi-0.5 at 14b params wouldn't do well. although it sure seems like it made a difference. i think there's a lot of work to do on the exact best data mixture.
Chris Paxton tweet mediaChris Paxton tweet media
English
3
1
43
3.6K
Mahi Shafiullah 🏠🤖
Mahi Shafiullah 🏠🤖@notmahi·
MolmoSpaces leaderboard is now open for submissions! When we created this benchmark for zero-shot real-to-sim eval in diverse homes, we didn’t expect things to heat up so quickly. But it did, thanks to @jang_yoel and team at GEAR toppling PI to take the crown on task-general category. Congrats 🎉 You can evaluate and submit your model to this leaderboard: molmospaces.allen.ai/leaderboard
Joel Jang@jang_yoel

𝐃𝐫𝐞𝐚𝐦𝐙𝐞𝐫𝐨 𝐢𝐬 #𝟏 𝐨𝐧 𝐛𝐨𝐭𝐡 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 𝐚𝐧𝐝 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 🏆 𝗪𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗶𝘀 𝗻𝗼𝘁𝗮𝗯𝗹𝗲: DreamZero-DROID is trained 𝑓𝑟𝑜𝑚 𝑠𝑐𝑟𝑎𝑡𝑐ℎ using only the DROID dataset. No pretraining on large-scale robot data, unlike competing VLAs. This demonstrates the strength of video-model backbones for generalist robot policies (VAMs/WAMs). More broadly, training 𝑜𝑛𝑙𝑦 on real data and evaluating on (1) transparent, distributed benchmarks like 𝐑𝐨𝐛𝐨𝐀𝐫𝐞𝐧𝐚 or (2) scalable sim-benchmarks like 𝐌𝐨𝐥𝐦𝐨𝐒𝐩𝐚𝐜𝐞𝐬 is an exciting step toward fairer and more reproducible evaluation of generalist policies, one that the community can hillclimb together to measure progress. Special thanks to the Ai2 MolmoSpaces team (@notmahi @omarrayyann @YejinKim4 Max Argus) and the RoboArena team (@pranav_atreya) for helping with the set-up and getting these evaluations! Special shout out to @youliangtan @NadunRanawakaA @chuning_zhu, who led these efforts from the GEAR side :) + We also release our DreamZero-AgiBot checkpoint & post-training code to enable very efficient few-shot adaptation. Post-train on just ~30 minutes of play data for your specific robot, and see the robot do basic language following and pick-and-place 🤗(See YAM experiments in our paper for more detail). ++ We also provide the entire codebase & preprocessed dataset to replicate the DreamZero-DROID checkpoint. 🌐 dreamzero0.github.io 💻 github.com/dreamzero0/dre… RoboArena: robo-arena.github.io/leaderboard MolmoSpaces: molmospaces.allen.ai/leaderboard

English
2
4
40
4.4K
Chris Paxton
Chris Paxton@chris_j_paxton·
@kvablack How do you compare language hours here? Just time taken to speak vs video?
English
1
0
0
91
Chris Paxton
Chris Paxton@chris_j_paxton·
We don't have the right pretraining data for robotics, it's important to change very low level features for robotics tasks, and robotics models are probably still too small and trained on too little data
Dominique Paul@DominiqueCAPaul

Interesting to see @physical_int move partner data into pretraining and get far better results than plain SFT. Two ideas as a consequence: ▶︎ RL > SFT, and Pi hasn’t cracked it yet. ▶︎ If expanding to new use cases requires retraining, not just SFT, then this is good news for compute providers & Nvidia

English
8
2
56
6.6K
Chris Paxton
Chris Paxton@chris_j_paxton·
people talk about the sheer number of hours and of tokens in robotics data, but it's just not the same as language. language data is very semantically rich; there's no fluff. in a video dataset, it's basically all fluff; a small subset of the image is enough to determine all that's relevant to the task. big gap to close.
English
1
0
9
863
Kevin Black retweetledi
SAIL Media
SAIL Media@readsail·
Robots have a "latency" problem. 🤖 💨 @kvablack explains how to use diffusion models and "Action Chunking" to make robot movements seamless—even when the AI is still "thinking." Watch the full clip on YT! Link in replies.
SAIL Media tweet media
English
1
1
18
2K
Kevin Black
Kevin Black@kvablack·
@kenbwork sure, I mean that "the literal error bar is symmetric when it consists of +-1 SEM". I think most would know that's what I (or Generalist) mean when we say "plotting the standard error".
English
1
0
0
280
Kenny Workman
Kenny Workman@kenbwork·
@kvablack Again, think conflating a specific use of the SE (Normal approximating confidence intervals using CLT) with what the SE is. SE is a *number*: \sqrt{\Var{\hat{\theta}}}}. There are many other ways to get asymmetric error bars from this number
English
1
0
0
254
Kevin Black
Kevin Black@kvablack·
I know I'm the only robot learning researcher to ever care about statistical rigor, but technically you shouldn't use standard error for a binary success rate. The binomial distribution isn't symmetrical 😅
Generalist@GeneralistAI

More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts). Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training. See blog addendum: generalistai.com/blog/nov-04-20…

English
8
5
170
24.7K
Kevin Black
Kevin Black@kvablack·
@kenbwork I mean that the literal error bar is symmetric about the sample mean when it's based on SE
English
1
0
0
312
Kenny Workman
Kenny Workman@kenbwork·
@kvablack SE doesn't assume symmetry in general. You might be conflating CLT-based estimates of SE with the definition of the SE itself?
English
1
0
0
285
Kevin Black
Kevin Black@kvablack·
@kenbwork you're right that it depends on how they're pooling though. if they're averaging multiple proportions then it's no longer binomial. not sure what you can do then besides do a lot more trials. or maybe just presenting the data per-task (unpooled) is better.
English
0
0
1
401
Kevin Black
Kevin Black@kvablack·
@kenbwork the SE is symmetric bc it relies on the CLT, which is fine for arbitrary distributions and a large enough sample size. but if you have a smaller sample size and you know the distribution is binomial, you can do better (e.g., Wilson score interval)
English
2
0
4
818
Kevin Black
Kevin Black@kvablack·
@aliuahma if you look it up it seems like the rule of thumb is np>10, which it doesn't seem like they have. but in practice I don't see a reason to ever use the normal approximation, especially with proportions near 0 or 1
English
1
0
6
827
ali
ali@aliuahma·
@kvablack isn't it okay as long as your sample proportion is calculated from a large number of trials? by CLT the distribution of p_hat approaches normality so using standard error is suitable
English
1
0
4
1K
Kevin Black
Kevin Black@kvablack·
@Christian061145 it all depends on your constraints. inference-time RTC is still more convenient. however, we already do a lot of post-training so we may as well add something there, and this simple method seems to work well enough. I'm sure ppl will find other methods that work better.
English
1
0
0
280
0xCC
0xCC@Christian061145·
@kvablack Nice work! Do you think it will replace the more complex RTC alternatives in the near future?
English
1
0
0
474
Kevin Black
Kevin Black@kvablack·
Last week I presented real-time chunking (RTC) at NeurIPS, and we did a live coffee demo the very same evening. To celebrate, we're releasing a (very short) follow-up paper describing a training-time variant of RTC, which is what we've actually been using in our demos!
English
13
34
440
33.3K
Kevin Black
Kevin Black@kvablack·
@m0hitsharma "d" in the paper includes network, but yeah, you need to know the rough range of delays at training time
English
0
0
5
393
Mohit Sharma
Mohit Sharma@m0hitsharma·
Cool paper, next is what happens if you run a huge (30+ B model) which doesn’t even run locally. We have network delay and inference delay, and they can always be larger than the assumed “d” in the paper (ie delay during training), what do you do then? Also choosing hparams correctly is critical.
Kevin Black@kvablack

Last week I presented real-time chunking (RTC) at NeurIPS, and we did a live coffee demo the very same evening. To celebrate, we're releasing a (very short) follow-up paper describing a training-time variant of RTC, which is what we've actually been using in our demos!

English
1
0
1
825
Kevin Black
Kevin Black@kvablack·
The method is stupidly simple -- we simulate delay at training time by conditioning on action prefixes. It only takes about 8 lines of code to implement, but it works just as well as inference-time RTC without the extra computational overhead.
Kevin Black tweet media
English
1
5
24
3.7K
Kevin Black retweetledi
Physical Intelligence
Physical Intelligence@physical_int·
Our model can now learn from its own experience with RL! Our new π*0.6 model can more than double throughput over a base model trained without RL, and can perform real-world tasks: making espresso drinks, folding diverse laundry, and assembling boxes. More in the thread below.
English
81
323
2.1K
696.1K