Adam Santoro

1.2K posts

Adam Santoro

@santoroAI

Research Scientist in artificial intelligence at DeepMind

Montréal, Québec เข้าร่วม Mayıs 2016

222 กำลังติดตาม9.1K ผู้ติดตาม

ทวีตที่ปักหมุด

Adam Santoro@santoroAI·4 Nis

Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs arxiv.org/abs/2404.02258

English

8.7K

Adam Santoro รีทวีตแล้ว

finbarr@finbarrtimbers·16 May

reading the "mixture of depths" paper, which comes up with a novel way to conditionally apply compute depth-wise in a decoder basically they use standard MoE-style expert-choice routing but they use it to choose which tokens get to go through every block in the decoder

English

1.8K

Adam Santoro รีทวีตแล้ว

Michael Chang@mmmbchang·14 May

Gemini and I also got a chance to watch the @OpenAI live announcement of gpt4o, using Project Astra! Congrats to the OpenAI team, super impressive work!

Michael Chang@mmmbchang

It's such an honor to work on Project Astra with such an amazing team from across Gemini and Google DeepMind! While the #GoogleIO keynote was happening we had a last minute idea of watching the keynote with Project Astra. Check it out!

English

241

1.2K

712.3K

Adam Santoro รีทวีตแล้ว

Google DeepMind@GoogleDeepMind·14 May

We watched #GoogleIO with Project Astra. 👀

English

220

1.3K

466.3K

Adam Santoro@santoroAI·26 Nis

@ivanleomk The FLOPs in the feedforward are not the same (MoD uses fewer), but you need to make the total training FLOPs (FLOPs-per-ffw * training steps) the same to see the effect. So, MoD trains for more steps

English

Ivan Leo@ivanleomk·26 Nis

@santoroAI Hmm maybe it’s a dumb question but if every alternate block is a MoD block and a large chunk of the FLOPs for a MoD block will be within the MoD block, why are the FLOPs the same as compared to a normal transformer if we only process 12.5-25% of the tokens per block?

English

Ivan Leo@ivanleomk·25 Nis

I just read the new Mixture of Depths paper for the @latentspacepod paper club this Friday. Here's what I understood and some questions I had about it ( and a link to some notes I made at the end )

English

1.3K

Adam Santoro@santoroAI·25 Nis

@ivanleomk The top-k isn't causal because whether a token is part of the top-k depends on the router weights of tokens that are after it in the sequence. During sampling you don't have these router weights since you need to produce tokens in a causal sequence

English

Ivan Leo@ivanleomk·25 Nis

5/ I'm still not fully sure about how the process of top-k sampling isn't causal in nature. My understanding so far is 1. We use top-k to choose a specific set of tokens 2. Attention is then computed as per normal on this subset of tokens 3. Previous token doesn't have access to new token state but the new output of its attention matrix was dependent on future tokens 4. This violates the causal nature of the transformer auto-regressive sampling

English

111

Adam Santoro@santoroAI·25 Nis

@ivanleomk Training is not faster (it takes the same amount of FLOPs, and ~wall clock). Rather, the resultant model is ~50% faster to step during sampling (post-training) because it requires ~50% of the FLOPs in the feedforward

English

Ivan Leo@ivanleomk·25 Nis

4/ This works quite well! We're able to see almost 2x faster training with MoD models in some cases, with better performance for the same number of flops But I'm still a little confused about how something could be 2x faster and yet take the same wall-clock time?

English

111

Adam Santoro รีทวีตแล้ว

Joey (e/λ)@shxf0072·11 Nis

Mixture of depth works for 300M Seq 512 its faster and archives better loss Code: github.com/joey00072/ohar… writeup: huggingface.co/blog/joey00072…

English

101

9.9K

Adam Santoro@santoroAI·10 Nis

@KujoJot32604166 It's the latter, preserving the original positions in the sequence

English

Paisley@KujoJot32604166·10 Nis

@santoroAI Hi Adam, great work your team is doing here! I'm trying to implement MoD on one of my projects, do you mind sharing a bit on how you dealt with RoPE? I understand it as after router the resampled tokens are treated like a new continuous seq or their original positions in the seq?

English

Adam Santoro@santoroAI·4 Nis

Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs arxiv.org/abs/2404.02258

English

8.7K

Adam Santoro@santoroAI·10 Nis

@iamgrigorev And increasing batch, or model size, or depth, etc, each has implications on how you tune the optimizer

English

247

Adam Santoro@santoroAI·10 Nis

@iamgrigorev Apologies for not being explicit: when I say match training FLOPs, I mean *exactly* matching. So you need to calculate the FLOPs per ffw of each model and tune the training steps accordingly

English

275

Adam Santoro รีทวีตแล้ว

George Grigorev@iamgrigorev·5 Nis

I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. github.com/thepowerfuldee… thanks @haeggee for initial code

English

354

54.4K

Adam Santoro@santoroAI·10 Nis

@iamgrigorev Thanks for the update! FYI if you don't make up for the lost FLOPs in some way (e.g. train isoFLOP) then performance will be worse. As you can see in the paper, wall clock/FLOPs are the same during training, not total tokens. The wins then come with inference speed

English

George Grigorev@iamgrigorev·10 Nis

Update on Mixture-of-Depths performance. Time to achieve 10B tokens: - With MoD: 47.3h - Without MoD: 55.3h Speed boost: 17% As you can see on plots, quality degrades in average compared to baseline. Although when compared avg of Piqa/Arc_easy/Sciq – no difference

English

414

Adam Santoro@santoroAI·6 Nis

@iamgrigorev @felix_red_panda @haeggee I agree, figuring out the best routing pattern per layer is an interesting thing to explore. No doubt there's something better than choosing some constant throughout the depth

English

George Grigorev@iamgrigorev·6 Nis

@santoroAI @felix_red_panda @haeggee By the way, I wonder if deeper layers might need more capacity_factor or vice-versa, but then it would be harder to write training code properly. Though you've shown that 12.5% vs 50% didn't show much of a difference

English

Adam Santoro@santoroAI·6 Nis

@felix_red_panda @iamgrigorev @haeggee e.g., putting 256 tokens through an MLP compared to 2048, per sequence. Batch size doesn't matter

English

177

Adam Santoro@santoroAI·6 Nis

@felix_red_panda @iamgrigorev @haeggee All the layers will always be active, the speed increases come from having to process a fraction of the sequence instead of the full thing. That fraction is constant as you change batch size

English

254

Adam Santoro@santoroAI·6 Nis

@iamgrigorev @haeggee Awesome! Cool to see memory reductions too, we knew they should be there but didn't measure them

English

684

Adam Santoro@santoroAI·5 Nis

@EsotericCofe and if you haven't yet, don't forget to set the training FLOP budgets appropriately (rather than training step budgets) so that the LR schedules are correct and you don't undertrain the model

English

162

Adam Santoro@santoroAI·5 Nis

@EsotericCofe Nice work! If you plot by flops instead of steps you'll get a better perspective on whether the implementation is working well (ideally the MoD transformer will have a better loss than vanilla throughout training, plotted by flops)

English

240

Nucleus☕️@EsotericCofe·5 Nis

implementation seems to work blue: MoD transformer, other: vanilla transformer when I get home I’ll try my hand at the MLP router for inference time

English

2.6K

Adam Santoro@santoroAI·5 Nis

@haeggee @MatPagliardini @akmohtashami_a @Olivia61368522 A sigmoid sounds like a good idea :)

English

Alex Hägele@haeggee·5 Nis

@santoroAI @MatPagliardini @akmohtashami_a @Olivia61368522 And actually another question: Do you have a specific reason to use the scalar of the linear projection directly as a weight? I would imagine these weights to be generally close to zero because of the dimensionality. Why not pass it through eg a sigmoid?

English

194

Alex Hägele@haeggee·5 Nis

If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: github.com/epfml/llm-base…

Hassan Hayat 🔥@TheSeaMouse

Why Google Deepmind's Mixture-of-Depths paper, and more generally dynamic compute methods, matter: Most of the compute is WASTED because not all tokens are equally hard to predict

English

224

35.3K

ค้นพบ

@OpenAI @ivanleomk @latentspacepod @KujoJot32604166 @iamgrigorev @haeggee @felix_red_panda @elonmusk