Michael Y. Li (@michaelyli_) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Michael Y. Li@michaelyli_·22 Nis

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

English

30

133

905

163K

Michael Y. Li retweetledi

Xavier Gonzalez@xavierjgonzalez·4d

Fixed point iterations for parallelizing nonlinear dynamics is all the rage: - Newton for RNNs - Picard for diffusion models - Jacobi for parallel decode of LLMs But how do these techniques relate, and when should you use them? We show you how in our new paper 🧵

English

6

27

169

19.8K

Michael Y. Li retweetledi

Michael Hu@michahu8·5d

What is the right data mix, and how do we find it as the data keeps changing? This is a core, unsolved problem in continual learning. To tackle it, we built a data mixing algo that works everywhere — pretraining, midtraining, instruction tuning Introducing: On-Policy Mix 🧵1/6

English

6

55

312

45.2K

Michael Y. Li@michaelyli_·13 May

Check out this awesome demo from Omar!

Omar Shaikh@oshaikh13

We upgraded Tabracadabra 🎉 to bring an entire context-aware assistant (not just tab to autocomplete!) to any textbox. It's pretty great if you hate switching between the chat interface and what you're working on. We're also open-sourcing, so you can try it out!🧵

English

0

2

252

Michael Y. Li retweetledi

Gordon Wetzstein@GordonWetzstein·12 May

AlphaFold-based models like Boltz-2 and BioEmu train on atomic conformational structures in order to predict protein dynamics. But is it possible to train these models directly on cryo-EM map ensembles, harnessing conformational data that is typically not deposited in the PDB? Introducing CryoSampler: a new approach for fine-tuning Boltz-2 with raw supervision on cryo-EM map ensembles. 1/6🧵

English

2

13

31

5.5K

Michael Y. Li@michaelyli_·10 May

@probablynotaz9 Awesome work, Anthony! Turns out defining the MDP carefully is important!

English

0

1

137

Michael Y. Li retweetledi

az@probablynotaz9·10 May

🚨 Solo-author ICML paper alert 🤫 Ever wanted to post-train your diffusion LLM with good old policy gradients, without having to deal with ELBOs or surrogates? In Simple Policy Gradients for Reasoning with Diffusion Language Models, we show how to make this tractable in a straightforward way. Our framework, Amortized GRPO (AGRPO), lets the model learn from unbiased PG updates via timestep estimation, naturally aligning with dLLM inference while remaining efficient + scalable. Paper: arxiv.org/abs/2510.04019 Code: github.com/probablyabot/a… 1/n

English

11

25

178

15.3K

Michael Y. Li@michaelyli_·24 Nis

@Infopulsed Thank you! And yes, we're super excited about the future directions.

English

0

135

EDITH@Infopulsed·23 Nis

@michaelyli__ very bullish on this line of work.... it's really incredible

English

1

0

2

157

Michael Y. Li@michaelyli_·22 Nis

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

English

30

133

905

163K

Michael Y. Li retweetledi

Luke Bailey@LukeBailey181·23 Nis

Self-play led to superhuman Go performance, why hasn’t it for LLMs? In practice, long run self-play plateaus like RL. We study why this happens, and build a self-play algorithm that scales better. It solves as many problems with a 7B model as the pass@4 of a model 100x bigger.

GIF

English

29

149

1K

141.9K

Christopher Manning@chrmanning·23 Nis

I think more is still needed for a good neural memory, but, nevertheless, this is a pretty cool step one!

Michael Y. Li@michaelyli_

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

English

5

22

273

54.8K

Michael Y. Li retweetledi

Chelsea Finn@chelseabfinn·23 Nis

RL fine-tuning often prematurely collapses LLM entropy. Poly-EPO is a scalable set-RL algorithm that optimizes for a set of accurate solutions with diverse reasoning strategies. Paper: arxiv.org/abs/2604.17654

Ifdita Hasan@ifdita_hasan

Deploying language models in scientific discovery domains requires extraordinary amounts of test-time compute for search algorithms. An ideal training algorithm should be designed with this goal in mind - that we want agents to learn how to not only exploit but also optimistically explore novel strategies. The agent should learn how to synergistically explore and exploit. We propose Poly-EPO, a set RL algorithm that explores and discovers diverse reasoning paths. Work with @jubayer_hamid (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn.

English

4

60

395

50.1K

Michael Y. Li@michaelyli_·23 Nis

@FrancoisChauba1 Thanks! And for sure!

English

0

501

Francois Chaubard@FrancoisChauba1·23 Nis

@michaelyli__ nice @michaelyli__ ! congrats!! now lets do it w zero order e2e!

English

1

0

2

616

Michael Y. Li@michaelyli_·23 Nis

@samchenn_ Not eliminate entirely — using them judiciously and remove when appropriate? Also re our discussion earlier, bullish on them more broadly for scientific applications!

English

0

1

313

Samuel Chen@samchenn_·23 Nis

@michaelyli__ This guys big on eliminating inductive biases🔥

English

1

0

1

352

Michael Y. Li@michaelyli_·23 Nis

There's a pretty easy way to relax this design choice. You can introduce some W_e's that map the hidden states to "e"s instead of using the q's to perform scoring. And you can initialize the W_es from the W_qs. More broadly, we think there's a lot to explore in the design space of how to richly parameterize this scoring mechanism.

English

1

0

2

824

AiDevCraft@AiDevCraft·23 Nis

Using the LM's own attention as the eviction score is the clever piece — you repurpose a signal pre-training already built, instead of bolting on a new head. The question is whether the RL signal eventually warps those scores away from their reasoning-time function, creating a tension between "what to attend to" and "what to keep."

English

1

0

2

937

Michael Y. Li@michaelyli_·23 Nis

@chrmanning Thanks Chris! And totally agree, we're excited about a bunch of followup directions!

English

0

283

Michael Y. Li@michaelyli_·23 Nis

@xavierjgonzalez Spot on summary Xavi, thanks for your interest in our work!

English

0

1

45

Xavier Gonzalez@xavierjgonzalez·22 Nis

Amazing paper. @michaelyli__ casts clearing the context--which is the most important thing you can do to help your agents to do well for long tasks--as a form of reasoning. This reasoning can be learned via RL. No messy heuristics. Just the magic of end-to-end learning

Michael Y. Li@michaelyli_

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

English

1

0

5

725

Michael Y. Li@michaelyli_·23 Nis

@singhh5050 Thanks Harsh!

English

0

326