

Luke J. Huang
21 posts

@whatthelukh
physics + cs @MIT | prev @appliedcompute, US ipho+gold medalist



Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.






one of the failure modes in async rl is that stale rollouts make the importance weights highly uneven...so the effective sample size collapses and a few bad trajectories start dominating the update...i like the algorithmic solution provided in the paper (also it preserves the essence of grpo)...it does not overreact by inventing a totally new rl paradigm...it identifies ess collapse and variance blow-up as the real failure signature then builds a targeted control mechanism around them.

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

Very impressive work Luke! Contributions like these push the boundary of high performance RL, especially in the open-source world where asyncrony lags behind the labs (no pun intended).

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇






