Luke J. Huang (@whatthelukh) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

3

12

68

11.6K

Luke J. Huang@whatthelukh·10h

wow, really cool that they can get stable FID gradient estimates by decoupling the feature pool from the training batch size! this also seems to have resparked a lot of discourse on whether FID is even a meaningful metric anymore, and honestly I think this work speaks to both sides. It pushes FD lower than ever (sub-0.75, one-step, pixel space!) with generally better quality, while also showing examples of Inception FD severely misranking visual quality. as someone who previously worked on image generation I've had my own doubts about FID for a while, so hopefully this gets the community to start seriously exploring/adopting multi-representation generation metrics!

Jiawei Yang@JiaweiYang118

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

English

0

1

111

Luke J. Huang retweetledi

Jiawei Yang@JiaweiYang118·5d

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

English

53

152

900

201.4K

Luke J. Huang@whatthelukh·22 Nis

Work done with @zhuoyang_zhang @wu_chengyue @Shang_mit @Yao__Lu @songhan_mit

English

0

1

88

Luke J. Huang@whatthelukh·22 Nis

Excited to give an oral presentation on Locality-Aware Parallel Decoding (LPD) at ICLR! Would love to connect if you're interested in generative models, ML systems, RL, or just want to chat. - Oral: Session 3B, Friday 10:30 AM - Poster: Friday 3:15–5:45 PM, Pavilion 3, P3-#710 See you at @iclr_conf! 🇧🇷

English

2

9

388

Luke J. Huang@whatthelukh·19 Nis

Thanks @novasarc01 for the thoughtful read. The silent ESS collapse is exactly what makes async RL so tricky, it looks fine until it suddenly doesn't. Glad you found our variance-targeted controls clean!

λux@novasarc01

one of the failure modes in async rl is that stale rollouts make the importance weights highly uneven...so the effective sample size collapses and a few bad trajectories start dominating the update...i like the algorithmic solution provided in the paper (also it preserves the essence of grpo)...it does not overreact by inventing a totally new rl paradigm...it identifies ess collapse and variance blow-up as the real failure signature then builds a targeted control mechanism around them.

English

0

232

Luke J. Huang retweetledi

Zhuoyang Zhang@zhuoyang_zhang·3 Mar

“A collapse in effective sample size (ESS) is the canary in a coal mine for async RL training.” Check this simple yet effective indicator we found to guide your async RL training. Stable, scalable, swift!

Luke J. Huang@whatthelukh

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

0

2

10

1.1K

Luke J. Huang@whatthelukh·3 Mar

record TTFM (time-to-first-meme). thanks Addison!

Addison@0xaddi

Very impressive work Luke! Contributions like these push the boundary of high performance RL, especially in the open-source world where asyncrony lags behind the labs (no pun intended).

English

0

1

260

Luke J. Huang retweetledi

Song Han@songhan_mit·3 Mar

A collapse in effective sample size (ESS) is the canary in a coal mine for async RL training. We should dampen those unreliable updates based on ESS. This effectively stabilize async RL training, matching the accuracy of sync RL but being a lot faster.

Luke J. Huang@whatthelukh

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

0

5

42

6.4K

Luke J. Huang@whatthelukh·3 Mar

We believe these variance-targeted controls are key for robustly stable Asynchronous RL at scale, enabling more efficient long-horizon RL training. We’ll also share a blog post on interesting details, including implementing OPOB “gradient surgery” with Megatron DP/TP/SP! Collaboration with @zhuoyang_zhang @Shang_mit @huqinghao @songhan_mit (8/8)

English

0

1

221

Luke J. Huang@whatthelukh·3 Mar

Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO + Async RL remains stable up to at least 128 steps off-policy. (7/8)

English

1

2

306

Luke J. Huang@whatthelukh·3 Mar

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

3

12

68

11.6K

Luke J. Huang retweetledi

Zhuoyang Zhang@zhuoyang_zhang·9 Tem

Paper: arxiv.org/abs/2507.01957 Code: github.com/mit-han-lab/lpd Collaborate w/ @whatthelukh @wu_chengyue @Shang_mit Kelly Peng @Yao__Lu @songhan_mit

English

0

1

3

566

Luke J. Huang retweetledi

Zhuoyang Zhang@zhuoyang_zhang·24 Şub

☀️We also believe this approach enables us to harness diverse data sources—including cross-embodiment and human data—to continually enhance robotic intelligence. Amazing collaboration with @Shang_mit @huqinghao @whatthelukh James Hou Yufei Sun @Yao__Lu @songhan_mit (8/8)

English

0

1

2

322

Luke J. Huang retweetledi

Zhuoyang Zhang@zhuoyang_zhang·24 Şub

We release ForeAct (accepted to CVPR’26🎉), a world model planner powered by visual foresight for VLAs - efficiently, modularly, and at scale. ✨ Seamlessly integrates with VLAs by visual augmentation — no architectural changes required ⚡ Generates high-fidelity 640×480 subgoal images in just 0.33s 🧠 Significantly boosts generalization capability and data efficiency 📄Paper: arxiv.org/abs/2602.12322… 🔗Code: github.com/mit-han-lab/fo… 🧵👇

English

4

28

124

23.6K

Luke J. Huang retweetledi

Zhuoyang Zhang@zhuoyang_zhang·9 Tem

🚀Check out #LPD - our latest work to accelerate autoregressive image generation. LPD stands for Locality-aware Parallel Decoding. ⚡️13× faster than traditional AR models and at least 3.4× faster than previous parallelized AR models. Github: github.com/mit-han-lab/lpd 🧵1/

English

2

15

81

14.4K

Luke J. Huang

Keşfet