az

9 posts

az

az

@probablynotaz9

@stanford

Stanford, CA Katılım Nisan 2024
127 Takip Edilen93 Takipçiler
az
az@probablynotaz9·
@sheriyuo Would love to chat at ICML :)
English
0
0
1
75
az
az@probablynotaz9·
Check out the full paper for more details, like variance reduction with entropy importance sampling and GPU memory optimizations. Shoutout to @jiaqihan99, @aaron_lou, and @michaelyli__ for valuable feedback during the early stages, @therealgabeguo and @StefanoErmon for supporting this project throughout, as well as @modal for sponsoring compute! 6/n
English
0
0
9
500
az
az@probablynotaz9·
Empirically, with only k=24 MC samples, AGRPO surpasses every comparable ELBO-based RL method across four reasoning tasks: GSM8K, MATH, Countdown, and Sudoku. These gains persist even for different context lengths and # of denoising steps (m) than what the model was trained on. A neat dLLM-specific result is that post-training completely changes the inference speed/quality frontier: AGRPO lets you achieve the same quality as the base LLaDA model with 4x fewer steps. In the real world, if you're serving a model to users, this would let you drastically cut inference costs by amortizing that cost into training. 5/n
az tweet mediaaz tweet media
English
1
1
4
693
az
az@probablynotaz9·
🚨 Solo-author ICML paper alert 🤫 Ever wanted to post-train your diffusion LLM with good old policy gradients, without having to deal with ELBOs or surrogates? In Simple Policy Gradients for Reasoning with Diffusion Language Models, we show how to make this tractable in a straightforward way. Our framework, Amortized GRPO (AGRPO), lets the model learn from unbiased PG updates via timestep estimation, naturally aligning with dLLM inference while remaining efficient + scalable. Paper: arxiv.org/abs/2510.04019 Code: github.com/probablyabot/a… 1/n
az tweet media
English
11
25
178
15.3K
az retweetledi
Michael Y. Li
Michael Y. Li@michaelyli_·
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!
Michael Y. Li tweet media
English
30
133
905
163K
az retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
I want there to be a nanoGPT style speedrunning setup for RL.
English
29
7
319
61.2K