Shang Yang

46 posts

Shang Yang

@Shang_mit

Ph.D. Student @MITEECS. Undergrad @Tsinghua_Uni.

Cambridge, MA, US Katılım Ekim 2023

389 Takip Edilen390 Takipçiler

Sabitlenmiş Tweet

Shang Yang@Shang_mit·27 Şub

🥳 Thrilled to see our work on TLT featured on the MIT homepage (mit.edu) and the cover of MIT News today! 🏛️✨ 🚀 2x faster RL training without losing accuracy. News: news.mit.edu/2026/new-metho… Paper: arxiv.org/abs/2511.16665 Code: github.com/mit-han-lab/fa…

Shang Yang@Shang_mit

🚀 Introducing TLT (Taming the Long-Tail), an efficient, lossless system that boosts reasoning RL training by mitigating the rollout bottleneck! 🏆 Accepted by #ASPLOS2026 ✨ What’s new? ⚡ Enjoy 2× faster end-to-end reasoning RL training 🔒 Lossless on-policy RL — training quality preserved theoretically and empirically 🎁 Get a free, high-quality draft model for efficient deployment 🔗 Github: github.com/mit-han-lab/fa… 🔗 Paper: arxiv.org/pdf/2511.16665 👇 More in the thread (1/7) #ASPLOS #EfficientAI #OpenSource #LLMs #Reasoning #ReinforcementLearning

English

13.6K

Shang Yang retweetledi

Song Han@songhan_mit·12 Mar

Honored to join the MIT President’s podcast. We discussed how Efficient AI and model compression techniques can help break the bottlenecks of AI computing: m.youtube.com/watch?v=XMphi6…

English

7.6K

Shang Yang retweetledi

Luke J. Huang@whatthelukh·3 Mar

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

11.6K

Shang Yang retweetledi

Zhijian Liu@zhijianliu_·24 Şub

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪

English

143

1.4K

170.3K

Shang Yang@Shang_mit·24 Şub

Enhancing your own VLAs with our visual planner. Efficient and easy to integrate! 🥳 📄Paper: arxiv.org/abs/2602.12322… 🔗Code: github.com/mit-han-lab/fo… #CVPR2026

Zhuoyang Zhang@zhuoyang_zhang

We release ForeAct (accepted to CVPR’26🎉), a world model planner powered by visual foresight for VLAs - efficiently, modularly, and at scale. ✨ Seamlessly integrates with VLAs by visual augmentation — no architectural changes required ⚡ Generates high-fidelity 640×480 subgoal images in just 0.33s 🧠 Significantly boosts generalization capability and data efficiency 📄Paper: arxiv.org/abs/2602.12322… 🔗Code: github.com/mit-han-lab/fo… 🧵👇

English

Shang Yang retweetledi

Hao Kang@GT_HaoKang·17 Şub

🔥Modifying 2 lines of code and get your agentic serving/rollout up to 3.9x faster losslessly! ⚡️Say hello to ThunderAgent, a fast, simple, and program-aware agentic Inference System. 🥇 We propose a program abstraction to schedule all GPU and CPU resources, the first principled approach for distributed agentic inference and rollout. 🌐 Blog: thunderagent.ai 💻 Code: github.com/ThunderAgent-o… 📜 Paper: arxiv.org/pdf/2602.13692 #AI #ThunderAgent #LLMAgent #Mlsys 1/n

English

107

29.1K

Shang Yang@Shang_mit·7 Oca

@Guangxuan_Xiao @MITEECS @thinkymachines congrats @Guangxuan_Xiao

English

508

Guangxuan Xiao@Guangxuan_Xiao·7 Oca

Life update: Wrapped up my PhD at @MITEECS 🎓 Super excited to start working on pre-training at @thinkymachines.

English

1.9K

73.6K

Shang Yang retweetledi

Zhijian Liu@zhijianliu_·6 Oca

Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.

English

233

1.8K

217.6K

Shang Yang retweetledi

Jack Cook@jackcookjack·2 Ara

Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵

English

255

69.3K

Shang Yang@Shang_mit·1 Ara

@Cameron_Chann Really appreciate it!

English

Changyu Chen@Cameron_Chann·30 Kas

Makes a lot of sense to me, thank you! This was actually something I wanted to work on with my systems colleague back in early 2024, but we felt the ecosystem wasn't mature enough to support it, so we shelved the idea. Really excited to see it realized in TLT, and pushed even further with system joint optimization. It's amazing. Congrats!

English

Shang Yang@Shang_mit·26 Kas

English

21K

Shang Yang@Shang_mit·28 Kas

Thank you, Changyu. The blog by Jiajun and Chenyang is excellent! The idea of applying speculative decoding to RL is gaining traction in the community, and our works approach the problem from complementary perspectives. For example, in TLT, we update the draft model by leveraging rollout bubbles, and the blog updates the drafter jointly with the target model backwards. Additionally, we also introduce optimizations such as adaptive rollout engine that auto-tunes SD configurations to handle the highly dynamic workloads in reasoning RL for better system efficiency. :)

English

104

Changyu Chen@Cameron_Chann·27 Kas

@Shang_mit Congrats Shang! It's awesome, can you share a quick comparison with this blog by @Chenan3_Zhao ? github.com/zhaochenyang20…

English

175

Shang Yang@Shang_mit·26 Kas

TLT delivers prominent end-to-end gains: 🔥 1.7×–2.1× training speedup across models at different scales 🔒 Lossless on-policy training — reward curves closely match the baseline A fast, reliable upgrade for reasoning LLM RL training! (7/7)

English

246

Shang Yang@Shang_mit·26 Kas

Rollout workloads are dynamic. The best SD config may change over time — no single setting works well! 👉 TLT auto-tunes SD configurations, always selecting the optimal strategy with zero extra user burden. (6/7)

English

291

Shang Yang retweetledi

Han Cai@hancai_hm·30 Eyl

🚀 Jet-Nemotron – Code & pre-trained checkpoints now available! ⚡️ Achieve up to 53.6× higher generation throughput on H100 GPUs with cost-efficient finetuning. 🔗 GitHub: github.com/NVlabs/Jet-Nem… 🔗 Hugging Face: huggingface.co/collections/je… 🔗 Paper: arxiv.org/abs/2508.15884

English

173

13.5K

Shang Yang retweetledi

Tianyuan Zhang@tianyuanzhang99·3 Haz

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch (no custom kernels) 🚀 10× GPU FLOPs utilization compared to previous nonlinear test-time training(ttt) methods. 🧠 Huge memory size (up to 40% of model params) Project page with code: tianyuanzhang.com/projects/ttt-d… (videos generated with our AR video diffusion) 1/9

English

429

101.5K

Keşfet

@Guangxuan_Xiao @MITEECS @thinkymachines @Cameron_Chann @Chenan3_Zhao @elonmusk @BarackObama @taylorswift13