Wentao Guo

65 posts

Wentao Guo

Wentao Guo

@WentaoGuo7

CS PhD student @PrincetonCS, Previously CS MEng + BS @CornellCIS

Katılım Kasım 2021
199 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
Wentao Guo
Wentao Guo@WentaoGuo7·
🚀SonicMoE🚀now runs at peak throughput on NVIDIA Blackwell GPUs 😃 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model with equal activated parameters and independent of expert granularity. We wrote a blogpost on how we leveraged Blackwell features and the software abstraction on QuACK: Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao
Wentao Guo tweet media
English
14
59
329
55.8K
Wentao Guo retweetledi
Lijie(Derrick) Yang
Lijie(Derrick) Yang@LijieyYang·
Excited to share that LessIsMore has been accepted to ICML 2026! 🚀 LessIsMore is a training-free sparse attention for efficient long-horizon reasoning. By enforcing cross-head unified token selection, it brings up to 1.6x E2E speedup while preserving reasoning accuracy under practical workloads. Huge thanks to my amazing co-authors and mentors @Jackfram2, @JiaZhihao, Ravi! Paper: arxiv.org/abs/2508.07101 Code: github.com/DerrickYLJ/Les… #ICML2026 #LLM #EfficientAI
Lijie(Derrick) Yang tweet media
English
8
20
72
8.6K
Wentao Guo
Wentao Guo@WentaoGuo7·
@IlysMoutawwakil I ran the benchmark on DGX B300 GPUs (Blackwell Ultra, technically SM103).
English
0
0
1
81
Ilyas
Ilyas@IlysMoutawwakil·
@WentaoGuo7 very awesome work ! which blackwell gpu did you run the benchmarks on ?
English
1
0
0
107
Wentao Guo
Wentao Guo@WentaoGuo7·
🚀SonicMoE🚀now runs at peak throughput on NVIDIA Blackwell GPUs 😃 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model with equal activated parameters and independent of expert granularity. We wrote a blogpost on how we leveraged Blackwell features and the software abstraction on QuACK: Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao
Wentao Guo tweet media
English
14
59
329
55.8K
Wentao Guo
Wentao Guo@WentaoGuo7·
@GoonGarrett Could you try it sometime? I guess it is the metadata computation that causes the hang before, but if it still appears I will take a look.
English
0
0
1
126
Garrett Goon
Garrett Goon@GoonGarrett·
@WentaoGuo7 Awesome, congrats! Have you gotten to test if the fully_shard hangs still occur with the update? I could try tomorrow if not.
English
1
0
0
130
Wentao Guo
Wentao Guo@WentaoGuo7·
[7/N] A detailed blogpost of our approach describing how we use Blackwell's TMEM double-buffering, 2CTA MMA, CLC tile scheduling, and gather fusion to hide IO costs behind MMA compute. We also walk through QuACK's software abstraction, and a few ablation studies on Blackwell GPUs. Blogpost: dao-lab.ai/blog/2026/soni…
English
0
1
9
637
Wentao Guo
Wentao Guo@WentaoGuo7·
[6/N] All SonicMoE Grouped GEMM kernels are built on QuACK. Each kernel overrides a single function. The most complex kernel (dH backward) only adds ~200 LoC for SonicMoE on top of QuACK, and the same code runs on Hopper and Blackwell GPUs.
Wentao Guo tweet media
English
0
1
9
559
Wentao Guo
Wentao Guo@WentaoGuo7·
[5/N] On Hopper, we leverage Ping-Pong warpgroup scheduling to overlap the heavy epilogue IO with the tiled GEMM computation. On Blackwell, the hardware does it differently but in the same spirit: we have a dedicated on-chip tensor memory (TMEM) split into two accumulator stages. The MMA warp fills one stage while epilogue warps handle the other stage, then they swap. The heavy epilogue IO is overlapped with the tiled GEMM again, which we elaborate on in the blogpost.
Wentao Guo tweet mediaWentao Guo tweet media
English
0
1
9
607
Wentao Guo
Wentao Guo@WentaoGuo7·
[4/N] SonicMoE is designed for fine-grained MoEs. It achieves greater relative speedup over existing MoE baselines when we increase the expert granularity on B300 GPUs.
Wentao Guo tweet media
English
1
2
11
649
Wentao Guo
Wentao Guo@WentaoGuo7·
[3/N] On B300 GPUs, SonicMoE achieves 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example across 6 open-source MoE configs (7B to 685B). SonicMoE often doubles the achieved TFLOPS over ScatterMoE and MoMoE.
Wentao Guo tweet media
English
0
1
11
864
Wentao Guo
Wentao Guo@WentaoGuo7·
[2/N] Modern MoEs are scaled towards the fine-grained regime where we have more smaller experts to activate. However, the activation memory footprint of existing MoE kernels will linearly increase and drain the VRAM resources. We instead compute the MoE backward pass in a different but math-equivalent way. Activation memory usage is now the same as a dense model with equal activated parameters, with 45% reduction from community MoE kernels. SonicMoE’s activation memory usage is also independent of expert granularity.
Wentao Guo tweet mediaWentao Guo tweet media
English
0
2
12
1.2K
Wentao Guo retweetledi
Jack Zhang
Jack Zhang@jcz42·
We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
Jack Zhang tweet media
English
17
164
1K
216.6K
Wentao Guo retweetledi
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
39
313
1.6K
443.9K
Wentao Guo retweetledi
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
7
131
781
228.5K
Wentao Guo retweetledi
Mayank Mishra
Mayank Mishra@MayankMish98·
We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.
English
17
73
746
371.3K