Shang Yang

46 posts

Shang Yang

Shang Yang

@Shang_mit

Ph.D. Student @MITEECS. Undergrad @Tsinghua_Uni.

Cambridge, MA, US Katılım Ekim 2023
389 Takip Edilen390 Takipçiler
Sabitlenmiş Tweet
Shang Yang
Shang Yang@Shang_mit·
🥳 Thrilled to see our work on TLT featured on the MIT homepage (mit.edu) and the cover of MIT News today! 🏛️✨ 🚀 2x faster RL training without losing accuracy. News: news.mit.edu/2026/new-metho… Paper: arxiv.org/abs/2511.16665 Code: github.com/mit-han-lab/fa…
Shang Yang tweet media
Shang Yang@Shang_mit

🚀 Introducing TLT (Taming the Long-Tail), an efficient, lossless system that boosts reasoning RL training by mitigating the rollout bottleneck! 🏆 Accepted by #ASPLOS2026 ✨ What’s new? ⚡ Enjoy 2× faster end-to-end reasoning RL training 🔒 Lossless on-policy RL — training quality preserved theoretically and empirically 🎁 Get a free, high-quality draft model for efficient deployment 🔗 Github: github.com/mit-han-lab/fa… 🔗 Paper: arxiv.org/pdf/2511.16665 👇 More in the thread (1/7) #ASPLOS #EfficientAI #OpenSource #LLMs #Reasoning #ReinforcementLearning

English
1
21
79
13.6K
Shang Yang retweetledi
Song Han
Song Han@songhan_mit·
Honored to join the MIT President’s podcast. We discussed how Efficient AI and model compression techniques can help break the bottlenecks of AI computing: m.youtube.com/watch?v=XMphi6…
English
1
3
32
7.6K
Shang Yang retweetledi
Luke J. Huang
Luke J. Huang@whatthelukh·
We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇
Luke J. Huang tweet media
English
3
12
68
11.6K
Shang Yang retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪
English
31
143
1.4K
170.3K
Shang Yang retweetledi
Hao Kang
Hao Kang@GT_HaoKang·
🔥Modifying 2 lines of code and get your agentic serving/rollout up to 3.9x faster losslessly! ⚡️Say hello to ThunderAgent, a fast, simple, and program-aware agentic Inference System. 🥇 We propose a program abstraction to schedule all GPU and CPU resources, the first principled approach for distributed agentic inference and rollout. 🌐 Blog: thunderagent.ai 💻 Code: github.com/ThunderAgent-o… 📜 Paper: arxiv.org/pdf/2602.13692 #AI #ThunderAgent #LLMAgent #Mlsys 1/n
English
3
24
107
29.1K
Shang Yang retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.
English
62
233
1.8K
217.6K
Shang Yang retweetledi
Jack Cook
Jack Cook@jackcookjack·
Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵
Jack Cook tweet media
English
6
41
255
69.3K
Changyu Chen
Changyu Chen@Cameron_Chann·
Makes a lot of sense to me, thank you! This was actually something I wanted to work on with my systems colleague back in early 2024, but we felt the ecosystem wasn't mature enough to support it, so we shelved the idea. Really excited to see it realized in TLT, and pushed even further with system joint optimization. It's amazing. Congrats!
English
1
0
0
28
Shang Yang
Shang Yang@Shang_mit·
🚀 Introducing TLT (Taming the Long-Tail), an efficient, lossless system that boosts reasoning RL training by mitigating the rollout bottleneck! 🏆 Accepted by #ASPLOS2026 ✨ What’s new? ⚡ Enjoy 2× faster end-to-end reasoning RL training 🔒 Lossless on-policy RL — training quality preserved theoretically and empirically 🎁 Get a free, high-quality draft model for efficient deployment 🔗 Github: github.com/mit-han-lab/fa… 🔗 Paper: arxiv.org/pdf/2511.16665 👇 More in the thread (1/7) #ASPLOS #EfficientAI #OpenSource #LLMs #Reasoning #ReinforcementLearning
Shang Yang tweet media
English
2
9
36
21K
Shang Yang
Shang Yang@Shang_mit·
Thank you, Changyu. The blog by Jiajun and Chenyang is excellent! The idea of applying speculative decoding to RL is gaining traction in the community, and our works approach the problem from complementary perspectives. For example, in TLT, we update the draft model by leveraging rollout bubbles, and the blog updates the drafter jointly with the target model backwards. Additionally, we also introduce optimizations such as adaptive rollout engine that auto-tunes SD configurations to handle the highly dynamic workloads in reasoning RL for better system efficiency. :)
English
1
0
1
104
Shang Yang
Shang Yang@Shang_mit·
TLT delivers prominent end-to-end gains: 🔥 1.7×–2.1× training speedup across models at different scales 🔒 Lossless on-policy training — reward curves closely match the baseline A fast, reliable upgrade for reasoning LLM RL training! (7/7)
Shang Yang tweet media
English
1
1
2
246
Shang Yang
Shang Yang@Shang_mit·
Rollout workloads are dynamic. The best SD config may change over time — no single setting works well! 👉 TLT auto-tunes SD configurations, always selecting the optimal strategy with zero extra user burden. (6/7)
Shang Yang tweet mediaShang Yang tweet media
English
1
1
2
291
Shang Yang retweetledi
Tianyuan Zhang
Tianyuan Zhang@tianyuanzhang99·
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch (no custom kernels) 🚀 10× GPU FLOPs utilization compared to previous nonlinear test-time training(ttt) methods. 🧠 Huge memory size (up to 40% of model params) Project page with code: tianyuanzhang.com/projects/ttt-d… (videos generated with our AR video diffusion) 1/9
English
7
80
429
101.5K