az (@probablynotaz9) - Twitter Profili | Zamantika Mersobahis Locabet

az@probablynotaz9·13 May

@sheriyuo Would love to chat at ICML :)

English

0

1

75

Xiuyu Li@sheriyuo·13 May

Nice dude🤣🤣 Exchanging the latest research ideas and progress on X is honestly pretty fascinating haha. Always happy to chat more! People have mentioned me in discussions before too. Feel free to come say hi at ICML (even though I’m socially awkward in real life😭)

az@probablynotaz9

🚨 Solo-author ICML paper alert 🤫 Ever wanted to post-train your diffusion LLM with good old policy gradients, without having to deal with ELBOs or surrogates? In Simple Policy Gradients for Reasoning with Diffusion Language Models, we show how to make this tractable in a straightforward way. Our framework, Amortized GRPO (AGRPO), lets the model learn from unbiased PG updates via timestep estimation, naturally aligning with dLLM inference while remaining efficient + scalable. Paper: arxiv.org/abs/2510.04019 Code: github.com/probablyabot/a… 1/n

English

1

0

26

2.2K

az@probablynotaz9·10 May

Check out the full paper for more details, like variance reduction with entropy importance sampling and GPU memory optimizations. Shoutout to @jiaqihan99, @aaron_lou, and @michaelyli__ for valuable feedback during the early stages, @therealgabeguo and @StefanoErmon for supporting this project throughout, as well as @modal for sponsoring compute! 6/n

English

0

9

500

az@probablynotaz9·10 May

Empirically, with only k=24 MC samples, AGRPO surpasses every comparable ELBO-based RL method across four reasoning tasks: GSM8K, MATH, Countdown, and Sudoku. These gains persist even for different context lengths and # of denoising steps (m) than what the model was trained on. A neat dLLM-specific result is that post-training completely changes the inference speed/quality frontier: AGRPO lets you achieve the same quality as the base LLaDA model with 4x fewer steps. In the real world, if you're serving a model to users, this would let you drastically cut inference costs by amortizing that cost into training. 5/n

English

1

4

693

az@probablynotaz9·10 May

🚨 Solo-author ICML paper alert 🤫 Ever wanted to post-train your diffusion LLM with good old policy gradients, without having to deal with ELBOs or surrogates? In Simple Policy Gradients for Reasoning with Diffusion Language Models, we show how to make this tractable in a straightforward way. Our framework, Amortized GRPO (AGRPO), lets the model learn from unbiased PG updates via timestep estimation, naturally aligning with dLLM inference while remaining efficient + scalable. Paper: arxiv.org/abs/2510.04019 Code: github.com/probablyabot/a… 1/n

English

11

25

178

15.3K

az retweetledi

Michael Y. Li@michaelyli_·22 Nis

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

English

30

133

905

163K

az retweetledi

Nathan Lambert@natolambert·6 Şub

I want there to be a nanoGPT style speedrunning setup for RL.

English

29

7

319

61.2K

az

Keşfet