Daniel Hesslow

170 posts

Daniel Hesslow

@DanielHesslow

Making gpus go brrr in unison at @AdaptiveML

Katılım Kasım 2011

548 Takip Edilen288 Takipçiler

Daniel Hesslow retweetledi

Adaptive ML@AdaptiveML·2 Ara

GSPO is everywhere. It powers Qwen3. It "fixes" GRPO. But what is it? We made a simple visual explanation. No jargon, no ML background needed. 👇

English

2.1K

Daniel Hesslow retweetledi

Adaptive ML@AdaptiveML·5 Kas

AI agents aren't magic🪄🙅‍♂️. They're language models trained to predict very specific text. When ChatGPT says 🔍"searching..." it's generating structured text that triggers actual tools. Here's what's really happening 👇

English

2.2K

Daniel Hesslow retweetledi

Google AI Developers@googleaidevs·22 Ağu

Some interesting use cases from @AdaptiveML powered by Gemma 🧵⬇️

English

187

24.3K

Daniel Hesslow retweetledi

Omar Sanseviero@osanseviero·26 Tem

SK Telecom + @AdaptiveML trained Gemma 3 4B with PPO obtaining impressive results, specially for a model of such size Learn more about how they did this adaptive-ml.com/post/sk-teleco…

English

3.3K

Daniel Hesslow retweetledi

Adaptive ML@AdaptiveML·23 Tem

Kimi K2 is a vision into the future of 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗼𝗼𝗹 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 and 𝗮𝗴𝗲𝗻𝘁𝗶𝗰 𝗺𝗼𝗱𝗲𝗹 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴. Leveraging 3,000+ MCP tools, the team generated 20,000+ synthetic tools and used them to train their 1T MoE model. 📜Paper and pipeline ⬇️

English

369

Daniel Hesslow@DanielHesslow·23 Tem

Shoutout from google is pretty great, nice work @alexchapeaux & and everyone over at @SKtelecom

Google AI Developers@googleaidevs

Build enterprise AI without the latency and cost of massive models. Learn how @AdaptiveML used Gemma 3 to create a multilingual customer service moderation LLM for @SKtelecom to support their 23M+ subscribers who speak a mix of English and Korean. deepmind.google/models/gemma/g…

English

122

Daniel Hesslow retweetledi

Google AI Developers@googleaidevs·23 Tem

English

6.3K

Daniel Hesslow retweetledi

Adaptive ML@AdaptiveML·22 Tem

Using Adaptive Engine, @SKtelecom tuned open models as small as Gemma 3 4B to exceed frontier performance (GPT-4.1, 3.7 Sonnet, and o4-mini) at multilingual content moderation. Our research 📃 and full results 👇

English

4.5K

Daniel Hesslow retweetledi

ES-FoMo@ICML2025@ESFoMo·18 Tem

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

English

49.3K

Daniel Hesslow@DanielHesslow·19 May

We're back for the third year running!

ES-FoMo@ICML2025@ESFoMo

ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇

English

123

Daniel Hesslow@DanielHesslow·4 Kas

Turns out we have a twitter at Adaptive ML, who knew? And it seems to be posting interesting stuff as well (completely unbiased observation)

Adaptive ML@AdaptiveML

Pretrained LLMs are aliens of extraordinary intelligence, yet little understanding. 👽 How do post-training techniques like 𝐒𝐅𝐓, 𝐑𝐄𝐈𝐍𝐅𝐎𝐑𝐂𝐄, and 𝐏𝐏𝐎 work in-tandem to turn these aliens into helpful AI assistants? 🧵 👇

English

190

Daniel Hesslow@DanielHesslow·30 Eyl

@giffmana tbh it's not amazing, but it is less terrible than most other options. cargo is great & static typing let's you refacto stuff much faster. otoh the borrow checker really can be a pain for quick prototyping of stuff, and proc macros are a terrible hack that needs to be used a lot

English

2.2K

Lucas Beyer (bl16)@giffmana·30 Eyl

Redpill me on rust. What should i read to give it a reasonable chance? Knowing that i: - 15y solid c++ experience up to including c++14 - think c++ is a bad language - think adding types to Python was a mistake - don’t foam on the mouth whenever i hear haskell - am pragmat

English

353

146.5K

Daniel Hesslow@DanielHesslow·5 Eyl

@tri_dao Specdec improving throughput is a suuuper nice finding!

English

268

Tri Dao@tri_dao·5 Eyl

Surprisingly, speculative decoding works well not just for small batch LLM inference but also large batch and long context. Once we understood the compute & memory profile of LLM inference, the new spec dec algorithms fall out naturally

Together AI@togethercompute

We are excited to share our latest work on speculative decoding for high-throughput inference! Before this work, we thought speculative decoding was useless at large batch sizes since the GPUs would go brrrr from processing all the different inputs. Much to our surprise, we discovered speculative decoding is quite useful if the inputs are long enough because decoding once again becomes memory-bound from the large KV cache. In fact, we show that speculative decoding can improve latency AND throughput by up to 2x in this regime! Read more here: together.ai/blog/speculati…

English

231

25.8K

Daniel Hesslow@DanielHesslow·5 Eyl

@Muennighoff @zach_nussbaum gist.github.com/DanielHesslow/…

QME

Daniel Hesslow@DanielHesslow·5 Eyl

@Muennighoff @zach_nussbaum I was a bit unclear, information can leak in the forward pass from future tokens into the previous ones. Here's an illustration, you can completely predict the next token, through leakage in the routing. Obviously very toy example, a single expert that choses a single token

English

Niklas Muennighoff@Muennighoff·4 Eyl

Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9

English

224

932

203.4K

Daniel Hesslow@DanielHesslow·5 Eyl

@Muennighoff @zach_nussbaum (And assigning it to the last "hello" is just a matter of counting the number of previous hellos, and have the routing be a function of that). I guess maybe you can't actually leak that much information through this. Might just be too costly to be worth exploiting it

English

Daniel Hesslow@DanielHesslow·5 Eyl

@Muennighoff @zach_nussbaum Like you should be able to have one expert that says there's no more "hello" and assign it to the last "hello" in the sequence or smth. But y'know sometimes the optimization process is not strong enough to be able to exploit every loop hole. Cool finding tho!

English

Daniel Hesslow@DanielHesslow·5 Eyl

@Muennighoff Interesting about EC vs TC, how do you do expert choice with a causal model?

English

Niklas Muennighoff@Muennighoff·4 Eyl

OLMoE Experiments 1) Expert granularity i.e. we use 64 small experts per layer 2) Dropless token choice beats expert choice routing 3) Shared experts worse for us 4) Sparse upcycling not helpful in our regime Check the paper for more🙂 🧵4/9

English

2.5K

Daniel Hesslow@DanielHesslow·30 Haz

This is a very nice direction from @PyTorch! Even when we need the highest possible performance, we can still use Torch as a first step and export the IR to external codebases with production guarantees around memory etc!

English

163

Daniel Hesslow@DanielHesslow·30 Haz

You can even get the compute graphs for training, which is super useful!

English

199

Daniel Hesslow@DanielHesslow·30 Haz

Want to know what actually goes on inside a PyTorch function? Found a new undocumented feature that shows it 👀

English

684

Keşfet

@AdaptiveML @alexchapeaux @SKtelecom @giffmana @tri_dao @Muennighoff @zach_nussbaum @elonmusk