Michael Y. Li

68 posts

Michael Y. Li

Michael Y. Li

@michaelyli_

CS PhD @StanfordAILab @StanfordNLP @Stanford advised by @noahdgoodman and Emily Fox. Prev: undergrad @princeton

Katılım Mart 2024
342 Takip Edilen611 Takipçiler
Sabitlenmiş Tweet
Michael Y. Li
Michael Y. Li@michaelyli_·
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!
Michael Y. Li tweet media
English
30
133
905
163K
Michael Y. Li retweetledi
Xavier Gonzalez
Xavier Gonzalez@xavierjgonzalez·
Fixed point iterations for parallelizing nonlinear dynamics is all the rage: - Newton for RNNs - Picard for diffusion models - Jacobi for parallel decode of LLMs But how do these techniques relate, and when should you use them? We show you how in our new paper 🧵
English
6
27
169
19.8K
Michael Y. Li retweetledi
Michael Hu
Michael Hu@michahu8·
What is the right data mix, and how do we find it as the data keeps changing? This is a core, unsolved problem in continual learning. To tackle it, we built a data mixing algo that works everywhere — pretraining, midtraining, instruction tuning Introducing: On-Policy Mix 🧵1/6
Michael Hu tweet media
English
6
55
312
45.2K
Michael Y. Li retweetledi
Gordon Wetzstein
Gordon Wetzstein@GordonWetzstein·
AlphaFold-based models like Boltz-2 and BioEmu train on atomic conformational structures in order to predict protein dynamics. But is it possible to train these models directly on cryo-EM map ensembles, harnessing conformational data that is typically not deposited in the PDB? Introducing CryoSampler: a new approach for fine-tuning Boltz-2 with raw supervision on cryo-EM map ensembles. 1/6🧵
English
2
13
31
5.5K
Michael Y. Li retweetledi
az
az@probablynotaz9·
🚨 Solo-author ICML paper alert 🤫 Ever wanted to post-train your diffusion LLM with good old policy gradients, without having to deal with ELBOs or surrogates? In Simple Policy Gradients for Reasoning with Diffusion Language Models, we show how to make this tractable in a straightforward way. Our framework, Amortized GRPO (AGRPO), lets the model learn from unbiased PG updates via timestep estimation, naturally aligning with dLLM inference while remaining efficient + scalable. Paper: arxiv.org/abs/2510.04019 Code: github.com/probablyabot/a… 1/n
az tweet media
English
11
25
178
15.3K
Michael Y. Li
Michael Y. Li@michaelyli_·
@Infopulsed Thank you! And yes, we're super excited about the future directions.
English
0
0
0
135
EDITH
EDITH@Infopulsed·
@michaelyli__ very bullish on this line of work.... it's really incredible
English
1
0
2
157
Michael Y. Li
Michael Y. Li@michaelyli_·
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!
Michael Y. Li tweet media
English
30
133
905
163K
Michael Y. Li retweetledi
Luke Bailey
Luke Bailey@LukeBailey181·
Self-play led to superhuman Go performance, why hasn’t it for LLMs? In practice, long run self-play plateaus like RL. We study why this happens, and build a self-play algorithm that scales better. It solves as many problems with a 7B model as the pass@4 of a model 100x bigger.
GIF
English
29
149
1K
141.9K
Michael Y. Li retweetledi
Michael Y. Li
Michael Y. Li@michaelyli_·
@samchenn_ Not eliminate entirely — using them judiciously and remove when appropriate? Also re our discussion earlier, bullish on them more broadly for scientific applications!
English
0
0
1
313
Michael Y. Li
Michael Y. Li@michaelyli_·
There's a pretty easy way to relax this design choice. You can introduce some W_e's that map the hidden states to "e"s instead of using the q's to perform scoring. And you can initialize the W_es from the W_qs. More broadly, we think there's a lot to explore in the design space of how to richly parameterize this scoring mechanism.
English
1
0
2
824
AiDevCraft
AiDevCraft@AiDevCraft·
Using the LM's own attention as the eviction score is the clever piece — you repurpose a signal pre-training already built, instead of bolting on a new head. The question is whether the RL signal eventually warps those scores away from their reasoning-time function, creating a tension between "what to attend to" and "what to keep."
English
1
0
2
937
Michael Y. Li
Michael Y. Li@michaelyli_·
@chrmanning Thanks Chris! And totally agree, we're excited about a bunch of followup directions!
English
0
0
0
283
Fabian Franz
Fabian Franz@fabianfranz·
@michaelyli__ Thanks for giving AI agency and working WITH the models. Very nice!
English
1
0
1
579
Aditya Cowsik
Aditya Cowsik@AdityaCowsik·
@michaelyli__ This really shows how much engineering you need to fully accept the bitter lesson!
English
1
0
7
698