Dan Busbridge

165 posts

Dan Busbridge banner
Dan Busbridge

Dan Busbridge

@danbusbridge

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom Katılım Kasım 2014
913 Takip Edilen873 Takipçiler
Sabitlenmiş Tweet
Dan Busbridge
Dan Busbridge@danbusbridge·
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 arxiv.org/abs/2502.08606
English
10
146
1K
122.6K
Dan Busbridge
Dan Busbridge@danbusbridge·
Uncertainty methods and correctness metrics often share "mutual bias" (systematic errors from a common confounder like response length), skewing LLM evaluations. New paper from my colleagues shows that "LM-as-a-judge" evaluation is more robust and human-aligned. Important work - check it out! arxiv.org/abs/2504.13677
Andrea Santilli@teelinsan

Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? 🚨 Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed

English
0
1
12
1.3K
Jeremy Bernstein
Jeremy Bernstein@jxbz·
Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies
Jeremy Bernstein tweet media
English
3
20
198
30.5K
Dan Busbridge
Dan Busbridge@danbusbridge·
Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!
Dan Busbridge tweet media
English
1
1
10
616
Dan Busbridge
Dan Busbridge@danbusbridge·
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.
Mustafa Shukor@MustafaShukor1

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

English
1
2
18
1.9K
Zeeshan Zia
Zeeshan Zia@ZeeshanZiaML·
Distillation is pretty important for production ML systems, but it's effectiveness is unpredictable, e.g. in Gemma 2 it helped enormously and hurt in MobileLLM. "Distillation Scaling Laws" paper from Apple perfoms thorough quantitative evaluation to come up with broadly applicable laws when a student is trained to match a teacher's next token distribution. Great 1-hour talk walking us through this one paper by @danbusbridge today @icmlconf.
Zeeshan Zia tweet media
English
1
1
9
751
Dan Busbridge
Dan Busbridge@danbusbridge·
Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo — exploring when and how small models can match the performance of large ones. 📍 Sunday, July 13, 5pm, West Ballroom A 🔗 icml.cc/virtual/2025/4…
English
3
4
27
11.7K
Dan Busbridge retweetledi
Jason Ramapuram
Jason Ramapuram@jramapuram·
Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): - Sigmoid: gs://axlearn-public/experiments/gala-7B-sigmoid-hybridnorm-alibi-sprp-2024-12-03-1002/checkpoints/ - Softmax: gs://axlearn-public/experiments/gala-7B-hybridnorm-alibi-sprp-2024-12-02-1445/checkpoints/ Inference code at github.com/apple/ml-sigmo…
Jason Ramapuram tweet media
Jason Ramapuram@jramapuram

Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.

English
1
14
45
9.6K
Dan Busbridge
Dan Busbridge@danbusbridge·
I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from @MustafaShukor1 and team! Check it out: arxiv.org/abs/2504.07951
Mustafa Shukor@MustafaShukor1

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

English
1
7
40
4.2K