Daniel Hesslow

170 posts

Daniel Hesslow banner
Daniel Hesslow

Daniel Hesslow

@DanielHesslow

Making gpus go brrr in unison at @AdaptiveML

Katılım Kasım 2011
548 Takip Edilen288 Takipçiler
Daniel Hesslow retweetledi
Adaptive ML
Adaptive ML@AdaptiveML·
GSPO is everywhere. It powers Qwen3. It "fixes" GRPO. But what is it? We made a simple visual explanation. No jargon, no ML background needed. 👇
Adaptive ML tweet media
English
1
1
6
2.1K
Daniel Hesslow retweetledi
Adaptive ML
Adaptive ML@AdaptiveML·
AI agents aren't magic🪄🙅‍♂️. They're language models trained to predict very specific text. When ChatGPT says 🔍"searching..." it's generating structured text that triggers actual tools. Here's what's really happening 👇
English
1
3
6
2.2K
Daniel Hesslow retweetledi
Google AI Developers
Google AI Developers@googleaidevs·
Some interesting use cases from @AdaptiveML powered by Gemma 🧵⬇️
English
5
10
187
24.3K
Daniel Hesslow retweetledi
Adaptive ML
Adaptive ML@AdaptiveML·
Kimi K2 is a vision into the future of 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗼𝗼𝗹 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 and 𝗮𝗴𝗲𝗻𝘁𝗶𝗰 𝗺𝗼𝗱𝗲𝗹 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴. Leveraging 3,000+ MCP tools, the team generated 20,000+ synthetic tools and used them to train their 1T MoE model. 📜Paper and pipeline ⬇️
Adaptive ML tweet media
English
1
3
10
369
Daniel Hesslow retweetledi
Adaptive ML
Adaptive ML@AdaptiveML·
Using Adaptive Engine, @SKtelecom tuned open models as small as Gemma 3 4B to exceed frontier performance (GPT-4.1, 3.7 Sonnet, and o4-mini) at multilingual content moderation. Our research 📃 and full results 👇
Adaptive ML tweet media
English
2
13
28
4.5K
Daniel Hesslow retweetledi
ES-FoMo@ICML2025
ES-FoMo@ICML2025@ESFoMo·
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
ES-FoMo@ICML2025 tweet media
English
3
20
80
49.3K
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@giffmana tbh it's not amazing, but it is less terrible than most other options. cargo is great & static typing let's you refacto stuff much faster. otoh the borrow checker really can be a pain for quick prototyping of stuff, and proc macros are a terrible hack that needs to be used a lot
English
2
0
7
2.2K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
Redpill me on rust. What should i read to give it a reasonable chance? Knowing that i: - 15y solid c++ experience up to including c++14 - think c++ is a bad language - think adding types to Python was a mistake - don’t foam on the mouth whenever i hear haskell - am pragmat
English
64
7
353
146.5K
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@tri_dao Specdec improving throughput is a suuuper nice finding!
English
0
0
2
268
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@Muennighoff @zach_nussbaum I was a bit unclear, information can leak in the forward pass from future tokens into the previous ones. Here's an illustration, you can completely predict the next token, through leakage in the routing. Obviously very toy example, a single expert that choses a single token
Daniel Hesslow tweet media
English
1
0
0
62
Niklas Muennighoff
Niklas Muennighoff@Muennighoff·
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9
Niklas Muennighoff tweet media
English
23
224
932
203.4K
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@Muennighoff @zach_nussbaum (And assigning it to the last "hello" is just a matter of counting the number of previous hellos, and have the routing be a function of that). I guess maybe you can't actually leak that much information through this. Might just be too costly to be worth exploiting it
English
1
0
0
50
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@Muennighoff @zach_nussbaum Like you should be able to have one expert that says there's no more "hello" and assign it to the last "hello" in the sequence or smth. But y'know sometimes the optimization process is not strong enough to be able to exploit every loop hole. Cool finding tho!
English
1
0
0
25
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
@Muennighoff Interesting about EC vs TC, how do you do expert choice with a causal model?
English
1
0
1
54
Niklas Muennighoff
Niklas Muennighoff@Muennighoff·
OLMoE Experiments 1) Expert granularity i.e. we use 64 small experts per layer 2) Dropless token choice beats expert choice routing 3) Shared experts worse for us 4) Sparse upcycling not helpful in our regime Check the paper for more🙂 🧵4/9
Niklas Muennighoff tweet media
English
2
0
36
2.5K
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
This is a very nice direction from @PyTorch! Even when we need the highest possible performance, we can still use Torch as a first step and export the IR to external codebases with production guarantees around memory etc!
English
0
0
3
163
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
You can even get the compute graphs for training, which is super useful!
Daniel Hesslow tweet media
English
1
0
3
199
Daniel Hesslow
Daniel Hesslow@DanielHesslow·
Want to know what actually goes on inside a PyTorch function? Found a new undocumented feature that shows it 👀
Daniel Hesslow tweet media
English
1
2
15
684