Adam Santoro

1.2K posts

Adam Santoro

Adam Santoro

@santoroAI

Research Scientist in artificial intelligence at DeepMind

Montréal, Québec เข้าร่วม Mayıs 2016
222 กำลังติดตาม9.1K ผู้ติดตาม
ทวีตที่ปักหมุด
Adam Santoro
Adam Santoro@santoroAI·
Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs arxiv.org/abs/2404.02258
English
4
17
75
8.7K
Adam Santoro รีทวีตแล้ว
finbarr
finbarr@finbarrtimbers·
reading the "mixture of depths" paper, which comes up with a novel way to conditionally apply compute depth-wise in a decoder basically they use standard MoE-style expert-choice routing but they use it to choose which tokens get to go through every block in the decoder
finbarr tweet media
English
1
2
18
1.8K
Adam Santoro รีทวีตแล้ว
Michael Chang
Michael Chang@mmmbchang·
Gemini and I also got a chance to watch the @OpenAI live announcement of gpt4o, using Project Astra! Congrats to the OpenAI team, super impressive work!
Michael Chang@mmmbchang

It's such an honor to work on Project Astra with such an amazing team from across Gemini and Google DeepMind! While the #GoogleIO keynote was happening we had a last minute idea of watching the keynote with Project Astra. Check it out!

English
54
241
1.2K
712.3K
Adam Santoro รีทวีตแล้ว
Google DeepMind
Google DeepMind@GoogleDeepMind·
We watched #GoogleIO with Project Astra. 👀
English
67
220
1.3K
466.3K
Adam Santoro
Adam Santoro@santoroAI·
@ivanleomk The FLOPs in the feedforward are not the same (MoD uses fewer), but you need to make the total training FLOPs (FLOPs-per-ffw * training steps) the same to see the effect. So, MoD trains for more steps
English
1
0
1
37
Ivan Leo
Ivan Leo@ivanleomk·
@santoroAI Hmm maybe it’s a dumb question but if every alternate block is a MoD block and a large chunk of the FLOPs for a MoD block will be within the MoD block, why are the FLOPs the same as compared to a normal transformer if we only process 12.5-25% of the tokens per block?
English
1
0
0
46
Ivan Leo
Ivan Leo@ivanleomk·
I just read the new Mixture of Depths paper for the @latentspacepod paper club this Friday. Here's what I understood and some questions I had about it ( and a link to some notes I made at the end )
English
2
0
9
1.3K
Adam Santoro
Adam Santoro@santoroAI·
@ivanleomk The top-k isn't causal because whether a token is part of the top-k depends on the router weights of tokens that are after it in the sequence. During sampling you don't have these router weights since you need to produce tokens in a causal sequence
English
1
0
0
45
Ivan Leo
Ivan Leo@ivanleomk·
5/ I'm still not fully sure about how the process of top-k sampling isn't causal in nature. My understanding so far is 1. We use top-k to choose a specific set of tokens 2. Attention is then computed as per normal on this subset of tokens 3. Previous token doesn't have access to new token state but the new output of its attention matrix was dependent on future tokens 4. This violates the causal nature of the transformer auto-regressive sampling
English
2
0
0
111
Adam Santoro
Adam Santoro@santoroAI·
@ivanleomk Training is not faster (it takes the same amount of FLOPs, and ~wall clock). Rather, the resultant model is ~50% faster to step during sampling (post-training) because it requires ~50% of the FLOPs in the feedforward
English
1
0
0
33
Ivan Leo
Ivan Leo@ivanleomk·
4/ This works quite well! We're able to see almost 2x faster training with MoD models in some cases, with better performance for the same number of flops But I'm still a little confused about how something could be 2x faster and yet take the same wall-clock time?
Ivan Leo tweet media
English
2
0
0
111
Paisley
Paisley@KujoJot32604166·
@santoroAI Hi Adam, great work your team is doing here! I'm trying to implement MoD on one of my projects, do you mind sharing a bit on how you dealt with RoPE? I understand it as after router the resampled tokens are treated like a new continuous seq or their original positions in the seq?
English
1
0
0
79
Adam Santoro
Adam Santoro@santoroAI·
Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs arxiv.org/abs/2404.02258
English
4
17
75
8.7K
Adam Santoro
Adam Santoro@santoroAI·
@iamgrigorev And increasing batch, or model size, or depth, etc, each has implications on how you tune the optimizer
English
1
0
0
247
Adam Santoro
Adam Santoro@santoroAI·
@iamgrigorev Apologies for not being explicit: when I say match training FLOPs, I mean *exactly* matching. So you need to calculate the FLOPs per ffw of each model and tune the training steps accordingly
English
1
0
1
275
Adam Santoro รีทวีตแล้ว
George Grigorev
George Grigorev@iamgrigorev·
I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. github.com/thepowerfuldee… thanks @haeggee for initial code
George Grigorev tweet media
English
6
50
354
54.4K
Adam Santoro
Adam Santoro@santoroAI·
@iamgrigorev Thanks for the update! FYI if you don't make up for the lost FLOPs in some way (e.g. train isoFLOP) then performance will be worse. As you can see in the paper, wall clock/FLOPs are the same during training, not total tokens. The wins then come with inference speed
English
2
0
0
94
George Grigorev
George Grigorev@iamgrigorev·
Update on Mixture-of-Depths performance. Time to achieve 10B tokens: - With MoD: 47.3h - Without MoD: 55.3h Speed boost: 17% As you can see on plots, quality degrades in average compared to baseline. Although when compared avg of Piqa/Arc_easy/Sciq – no difference
George Grigorev tweet media
English
3
0
4
414
Adam Santoro
Adam Santoro@santoroAI·
@iamgrigorev @felix_red_panda @haeggee I agree, figuring out the best routing pattern per layer is an interesting thing to explore. No doubt there's something better than choosing some constant throughout the depth
English
0
0
0
52
George Grigorev
George Grigorev@iamgrigorev·
@santoroAI @felix_red_panda @haeggee By the way, I wonder if deeper layers might need more capacity_factor or vice-versa, but then it would be harder to write training code properly. Though you've shown that 12.5% vs 50% didn't show much of a difference
English
1
0
1
70
Adam Santoro
Adam Santoro@santoroAI·
@felix_red_panda @iamgrigorev @haeggee All the layers will always be active, the speed increases come from having to process a fraction of the sequence instead of the full thing. That fraction is constant as you change batch size
English
2
1
4
254
Adam Santoro
Adam Santoro@santoroAI·
@iamgrigorev @haeggee Awesome! Cool to see memory reductions too, we knew they should be there but didn't measure them
English
0
0
0
684
Adam Santoro
Adam Santoro@santoroAI·
@EsotericCofe and if you haven't yet, don't forget to set the training FLOP budgets appropriately (rather than training step budgets) so that the LR schedules are correct and you don't undertrain the model
English
0
0
2
162
Adam Santoro
Adam Santoro@santoroAI·
@EsotericCofe Nice work! If you plot by flops instead of steps you'll get a better perspective on whether the implementation is working well (ideally the MoD transformer will have a better loss than vanilla throughout training, plotted by flops)
English
1
0
3
240
Nucleus☕️
Nucleus☕️@EsotericCofe·
implementation seems to work blue: MoD transformer, other: vanilla transformer when I get home I’ll try my hand at the MLP router for inference time
Nucleus☕️ tweet media
English
4
2
32
2.6K
Alex Hägele
Alex Hägele@haeggee·
@santoroAI @MatPagliardini @akmohtashami_a @Olivia61368522 And actually another question: Do you have a specific reason to use the scalar of the linear projection directly as a weight? I would imagine these weights to be generally close to zero because of the dimensionality. Why not pass it through eg a sigmoid?
English
1
0
2
194
Alex Hägele
Alex Hägele@haeggee·
If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: github.com/epfml/llm-base…
Alex Hägele tweet media
Hassan Hayat 🔥@TheSeaMouse

Why Google Deepmind's Mixture-of-Depths paper, and more generally dynamic compute methods, matter: Most of the compute is WASTED because not all tokens are equally hard to predict

English
5
49
224
35.3K