Dan Busbridge

165 posts

Dan Busbridge

@danbusbridge

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom Katılım Kasım 2014

913 Takip Edilen873 Takipçiler

Sabitlenmiş Tweet

Dan Busbridge@danbusbridge·13 Şub

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 arxiv.org/abs/2502.08606

English

146

122.6K

Dan Busbridge@danbusbridge·28 Tem

Uncertainty methods and correctness metrics often share "mutual bias" (systematic errors from a common confounder like response length), skewing LLM evaluations. New paper from my colleagues shows that "LM-as-a-judge" evaluation is more robust and human-aligned. Important work - check it out! arxiv.org/abs/2504.13677

Andrea Santilli@teelinsan

Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? 🚨 Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed

English

1.3K

Dan Busbridge@danbusbridge·17 Tem

@jxbz @phillip_isola @tmjlarge @yangliux1 @minyoung_huh @hyojinbahng Awesome work @jxbz and Laker! Sent an email in case you're around for coffee, would love to discuss more.

English

119

Jeremy Bernstein@jxbz·16 Tem

This work was done in @phillip_isola’s group at MIT on the heels of the modular norm work with @tmjlarge, @yangliux1, @minyoung_huh and @hyojinbahng. paper: arxiv.org/abs/2410.21265 docs: docs.modula.systems code: github.com/modula-systems…

English

2.4K

Jeremy Bernstein@jxbz·16 Tem

Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies

English

198

30.5K

Dan Busbridge@danbusbridge·16 Tem

Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!

English

616

Dan Busbridge@danbusbridge·15 Tem

Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.

Mustafa Shukor@MustafaShukor1

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

English

1.9K

Dan Busbridge@danbusbridge·14 Tem

@feijianghan Great to hear it was useful, thanks @feijianghan!

English

Chase Han@feijianghan·14 Tem

Fascinating talk from Dan at ICML on Distillation Scaling Laws. I learned that the best teacher model isn't just the strongest; it needs to match the student's capacity. Also, distillation can be more efficient than supervised training when you have limited data. Great work!

Dan Busbridge@danbusbridge

Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!

English

270

Dan Busbridge@danbusbridge·14 Tem

@mciccone_AI Thanks @mciccone_AI !

English

Marco Ciccone@mciccone_AI·14 Tem

Dan’s talk was a masterclass — Go watch the recording. Super clear, packed with results, and genuinely one of the most well-delivered talks I’ve seen in a while.

Dan Busbridge@danbusbridge

Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!

English

443

Dan Busbridge@danbusbridge·14 Tem

@DrZeeshanZia Thanks for coming @DrZeeshanZia, great to hear the talk was useful!

English

Zeeshan Zia@ZeeshanZiaML·14 Tem

Distillation is pretty important for production ML systems, but it's effectiveness is unpredictable, e.g. in Gemma 2 it helped enormously and hurt in MobileLLM. "Distillation Scaling Laws" paper from Apple perfoms thorough quantitative evaluation to come up with broadly applicable laws when a student is trained to match a teacher's next token distribution. Great 1-hour talk walking us through this one paper by @danbusbridge today @icmlconf.

English

751

Dan Busbridge@danbusbridge·14 Tem

Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!

Dan Busbridge@danbusbridge

Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo — exploring when and how small models can match the performance of large ones. 📍 Sunday, July 13, 5pm, West Ballroom A 🔗 icml.cc/virtual/2025/4…

English

101

11K

Dan Busbridge@danbusbridge·12 Tem

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events: 🔗 machinelearning.apple.com/updates/apple-… Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).

English

393

Dan Busbridge@danbusbridge·12 Tem

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708) with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin 🔗 icml.cc/virtual/2025/p…

English

354

Dan Busbridge@danbusbridge·12 Tem

English

11.7K

Dan Busbridge retweetledi

Jason Ramapuram@jramapuram·24 Nis

Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): - Sigmoid: gs://axlearn-public/experiments/gala-7B-sigmoid-hybridnorm-alibi-sprp-2024-12-03-1002/checkpoints/ - Softmax: gs://axlearn-public/experiments/gala-7B-hybridnorm-alibi-sprp-2024-12-02-1445/checkpoints/ Inference code at github.com/apple/ml-sigmo…

Jason Ramapuram@jramapuram

Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.

English

9.6K

Dan Busbridge@danbusbridge·14 Nis

I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from @MustafaShukor1 and team! Check it out: arxiv.org/abs/2504.07951

Mustafa Shukor@MustafaShukor1

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

English

4.2K

Keşfet

@jxbz @phillip_isola @tmjlarge @yangliux1 @minyoung_huh @hyojinbahng @AmitisShidani1 @PierreAblin