Wentao Guo (@WentaoGuo7) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

Wentao Guo@WentaoGuo7·19 Ara

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

24

111

641

245.4K

Wentao Guo รีทวีตแล้ว

Mayank Mishra@MayankMish98·14h

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

7

86

362

70.7K

Wentao Guo รีทวีตแล้ว

Albert Gu@_albertgu·2d

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

36

311

1.6K

406.3K

Mayank Mishra@MayankMish98·5 Mar

FA4 now available in lm-engine: github.com/open-lm-engine… 13.4% end-to-end speedup for Llama 8B training on 4x GB200s (1 node) 🚀🚀🚀 1005.55 TFLOPs for SDPA vs 1140.73 for FA4 (BF16 precision) @tedzadouri @ultraproduct @__tensorcore__ @tri_dao cooked Thanks to @bharatrunwal2 for running the experiment!

English

2

11

59

6K

Wentao Guo@WentaoGuo7·6 Mar

@MayankMish98 @tedzadouri @ultraproduct @__tensorcore__ @tri_dao @bharatrunwal2 So quick @MayankMish98 😘

English

0

1

127

Wentao Guo รีทวีตแล้ว

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

6

132

780

219.4K

Wentao Guo รีทวีตแล้ว

Mayank Mishra@MayankMish98·26 Şub

We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.

English

17

73

747

367.9K

Wentao Guo รีทวีตแล้ว

Liane Galanti@lianegalanti·12 Şub

Gave a talk at @GoogleDeepMind on our joint work with @HazanPrinceton and @tri_dao on deployable RL policies for robotics. Great discussion with @JeffDean.

English

10

20

169

22.2K

Wentao Guo@WentaoGuo7·22 Oca

@woosuk_k @inferact @vllm_project Congrats guys!

English

0

3

127

Wentao Guo รีทวีตแล้ว

Woosuk Kwon@woosuk_k·22 Oca

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English

177

126

1.1K

464.3K

Wentao Guo@WentaoGuo7·1 Oca

@SonglinYang4 @MITEECS @thinkymachines Congrats!

English

0

1

237

Songlin Yang@SonglinYang4·1 Oca

Life update at the end of 2025: I’ve completed my PhD at @MITEECS and joined @thinkymachines to work on LLM archs

English

79

18

1.7K

86K

Wentao Guo@WentaoGuo7·19 Ara

@ziruirayliu Thank you!

English

0

2

215

Zirui Liu@ziruirayliu·19 Ara

This is the top infra research.

Wentao Guo@WentaoGuo7

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

1

2

82

10.1K

Wentao Guo@WentaoGuo7·19 Ara

@dmsobol @MayankMish98 @XinleC295 @istoica05 @tri_dao Token rounding will always produce per-expert token count as a multiple of tile size, and the maximum deviation for token-choice is 1 tile size (128) per expert. We observe no significant quality degradation as long as average received tokens per expert / tile size >= 2.

English

0

2

737

Daria Soboleva@dmsobol·19 Ara

@WentaoGuo7 @MayankMish98 @XinleC295 @istoica05 @tri_dao Really cool to see more optimized MoE implementations in the open source! I see that proposed routing is dropping tokens, could you share what’s max tokens you allowed to drop without losing quality?

English

1

0

4

957

Wentao Guo@WentaoGuo7·19 Ara

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

24

111

641

245.4K

Wentao Guo@WentaoGuo7·19 Ara

@HanGuo97 Thank you!!

English

0

6

186

Han Guo@HanGuo97·19 Ara

Extremely cool work, congrats!!

Tri Dao@tri_dao

This is what we've been coking for the last 9 months: make MoEs training goes ~2x faster and ~2x less memory! Highlights: - MoE typically takes the most time and memory in modern models. Turns out one can mathematically rewrite the MoE backward pass to reduce the activation mem you need to store in the fwd by ~2x, resulting in the same gradients with no extra matmul recomputation. I really like this result, as it combines both algorithmic and systems insights. - Analyzing bottlenecks in MoE layer leads to a natural optimization stragegy: reduce mem reads/writes as much as possible! Gathering the input for fwd and output grad for bwd can sometimes take as much time as the grouped GEMMs. We fuse gather with grouped GEMM + overlap mem access and compute to make the whole layer goes ~2x faster. - Computing top-k for expert routing can take surprisingly long, ~15-20% of the whole MoE layer! Standard top-k impl uses radix top-k algo, great for large k but suboptimal for small k. We rewrote top-k using bitonic top-k algo, and it's sometimes 20-30x faster than pytorch's top-k! All the main kernels are written in Cute-DSL so they should be easy to extend (and install :D). Hopper kernels are out, Blackwell kernels are just about ready. MoE models used to be 2x less hardware-efficient to train, hopefully Sonic-MOE will change that.

English

1

0

19

3.4K

Wentao Guo@WentaoGuo7·19 Ara

@BrendanBurkeX @tri_dao The improvement will be more significant. Blackwell offers more asynchony features where the design of SonicMoE will have a greater win.

English

1

0

2

183

Brendan Burke @ NVIDIA GTC@BrendanBurkeX·19 Ara

@tri_dao Coking indeed! Do you think Blackwell will see the same effect given on-chip tensor memory?

English

1

0

3

2.9K

Tri Dao@tri_dao·19 Ara

This is what we've been coking for the last 9 months: make MoEs training goes ~2x faster and ~2x less memory! Highlights: - MoE typically takes the most time and memory in modern models. Turns out one can mathematically rewrite the MoE backward pass to reduce the activation mem you need to store in the fwd by ~2x, resulting in the same gradients with no extra matmul recomputation. I really like this result, as it combines both algorithmic and systems insights. - Analyzing bottlenecks in MoE layer leads to a natural optimization stragegy: reduce mem reads/writes as much as possible! Gathering the input for fwd and output grad for bwd can sometimes take as much time as the grouped GEMMs. We fuse gather with grouped GEMM + overlap mem access and compute to make the whole layer goes ~2x faster. - Computing top-k for expert routing can take surprisingly long, ~15-20% of the whole MoE layer! Standard top-k impl uses radix top-k algo, great for large k but suboptimal for small k. We rewrote top-k using bitonic top-k algo, and it's sometimes 20-30x faster than pytorch's top-k! All the main kernels are written in Cute-DSL so they should be easy to extend (and install :D). Hopper kernels are out, Blackwell kernels are just about ready. MoE models used to be 2x less hardware-efficient to train, hopefully Sonic-MOE will change that.

Wentao Guo@WentaoGuo7

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

30

167

1.5K

157.2K

Wentao Guo@WentaoGuo7·19 Ara

@SonglinYang4 Thank you!!!🥰🥰🥰

English

0

5

211

Songlin Yang@SonglinYang4·19 Ara

🤩

Wentao Guo@WentaoGuo7

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

ART

3

0

63

9.7K

Wentao Guo@WentaoGuo7·19 Ara

@MayankMish98 @XinleC295 @istoica05 @tri_dao Many thanks to NVIDIA cutlass team @nvidia, PLI @PrincetonPLI, SkyLab @BerkeleySky and Together AI @togethercompute

English

0

1

15

1.7K

Wentao Guo@WentaoGuo7·19 Ara

@MayankMish98 @XinleC295 @istoica05 @tri_dao [5/N] The token rounding routing eliminates padding waste in grouped GEMM and can achieve 16% relative speedup over TC top-K on kernel computation time while delivering robust token-choice accuracy even under highly sparse MoE training regimes.

English

0

2

19

1.8K

Wentao Guo@WentaoGuo7·19 Ara

@MayankMish98 @XinleC295 @istoica05 @tri_dao [4/N] For highly-sparse MoEs, the FLOPs wasted by Grouped GEMM tile-based computation (due to padding) linearly scales w.r.t. expert activation ratio. We introduce a token rounding routing algorithm that rounds the per-expert token count to avoid wasted FLOPs.

English

0

4

22

1.9K

Wentao Guo@WentaoGuo7·19 Ara

@MayankMish98 @XinleC295 @istoica05 @tri_dao [3/N] SonicMoE achieves 50%+ speedup compared to existing baselines for a single MoE layer with open-source MoE configs from 7B to 685B. For 7B MoE model with FSDP-2, SonicMoE on 64 H100s gets 213B tokens/day while ScatterMoE (previous SOTA) on 96 H100s gets 225B tokens/day.

English

0

3

22

2.1K

Wentao Guo@WentaoGuo7·19 Ara

@MayankMish98 @XinleC295 @istoica05 @tri_dao [2/N] The activation memory usage is same as a dense model with equal activated parameters (minimum activation memory required for backward computation without doing activation recomputation with GEMM). SonicMoE’s activation memory usage is also independent of expert granularity.

English

0

3

22

3.5K

Wentao Guo

ค้นพบ