Wenxuan Tan

15 posts

Wenxuan Tan

Wenxuan Tan

@EdenTan20

Katılım Mayıs 2022
4 Takip Edilen1 Takipçiler
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@dogacel0 @ezyang I wonder how much api costs were used per task? (Claude should return that in the session json?) Just trying to compare cost efficiency with 4.7
English
0
0
0
15
Doğaç
Doğaç@dogacel0·
I've let Fable optimize GPU kernels autonomously using "auto-gpu-kernel" harness, if it joined the NVIDIA's competition today, it would have won 🥇 in 4/5 kernels against humans. Fable can write Gluon kernels, do warp-specialization, use TMA tcgen05 etc. (Speedup vs Opus 4.8)
Doğaç tweet media
Doğaç@dogacel0

Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track.

English
11
20
258
27.3K
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@mcray69 you are retarded and stubborn, stop lying to yourself🤣🤣 it's the official api
English
0
0
0
2
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@Andy_ShuoYang Thanks for the great work! I wonder if you used any custom agent loop for writing the cutedsl kernels? It seems like a lot of work😃
English
1
0
1
61
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…
English
47
236
1.6K
866K
Hanchen Li
Hanchen Li@lihanc02·
Opus 4.8 claims itself to be Qwen if you ask in Chinese. I guess the anti-distillation is working well for them 🤣 Photo from Wenxuan Tan’s LinkedIn
Hanchen Li tweet media
English
36
63
1K
116.2K
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@minhhai2209 @lihanc02 the full command is there; you are welcome to reproduce it if you can put some money into the api😘
English
0
0
0
10
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@CliffLattner @haoailab @ye_combinator Yes, for B200 attention it does give the same speedup, though for other hardware FP4 should be faster. Our contribution is to show that FP4 attn works across language and video diffusion.
English
0
0
1
50
Wenxuan Tan retweetledi
Hao AI Lab
Hao AI Lab@haoailab·
(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: haoailab.com/blogs/attn-qat/ Code: github.com/hao-ai-lab/Fas… Checkpoints: huggingface.co/FastVideo/14B_…
English
9
31
244
37.3K
Wenxuan Tan retweetledi
Hao Zhang
Hao Zhang@haozhangml·
We (FastVideo Team) just keep shipping! For the first time, we push FP4 QAT into attention — and it actually works very well. Let me explain why this is a bigger deal than you might think, as I think the field is underestimating how fast the FP4 era is arriving and how important it will be. Start with the hardware. On Vera Rubin, FP4 is not 2× FP8. If you check the spec, R200 delivers ~50 PFLOPS NVFP4 inference vs ~16 PFLOPS FP8 — over 3× the throughput. NVIDIA is betting VR silicon on FP4. The compute is already sitting on the die, waiting. The only question has ever been whether we can make FP4 work end-to-end. Now look at where that stack stood until this week: ✅ FP8 training — now the norm ✅ FP4 MLP inference — already usable in many production settings ❌ FP4 attention — "too sensitive, too many outliers, won't work" Attention was the missing piece. And without it, there is no such thing as an end-to-end FP4 model — you're always paying the FP8 tax on the part of the network that scales worst with context length. The one component we most want in low precision was the one component nobody had a working recipe for. This work basically fills that last gap. FP4 MLP + FP4 attention is now a complete FP4 inference path. That's what I mean when I say this closes the loop on FP4. And if FP4 attention is already here in 2026 — FP2 is closer than the field thinks. 🫡🫡 Go read the thread, try the code, tell us where it breaks
Hao AI Lab@haoailab

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: haoailab.com/blogs/attn-qat/ Code: github.com/hao-ai-lab/Fas… Checkpoints: huggingface.co/FastVideo/14B_…

English
1
3
34
4.5K
Wenxuan Tan retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
🚀 SGLang-Diffusion Update: Two Months In! Since launch, we've optimized SGLang-Diffusion to be 1.5x faster, achieving state-of-the-art inference speeds (up to 5x vs others). Key Updates: 🔥 New Models: Day-0 support for Flux.2, Qwen-Image series, Z-Image-Turbo, GLM-Image and more ⚡️ Performance: Integrated Cache-DiT (up to +169% speedup) & Layerwise Offload 🛠️ Features: Full LoRA HTTP API + ComfyUI custom node support ⚙️ Hardware: Optimized for NVIDIA (4090/5090), AMD, and MUSA Huge thanks to our open-source community and partners for the support! 🙌
LMSYS Org tweet mediaLMSYS Org tweet media
English
3
8
50
13.1K
Wenxuan Tan
Wenxuan Tan@EdenTan20·
@tri_dao Great work! Any thoughts on how to extend this to multi-node RDMA?
English
0
0
0
552
Tri Dao
Tri Dao@tri_dao·
This is what we've been coking for the last 9 months: make MoEs training goes ~2x faster and ~2x less memory! Highlights: - MoE typically takes the most time and memory in modern models. Turns out one can mathematically rewrite the MoE backward pass to reduce the activation mem you need to store in the fwd by ~2x, resulting in the same gradients with no extra matmul recomputation. I really like this result, as it combines both algorithmic and systems insights. - Analyzing bottlenecks in MoE layer leads to a natural optimization stragegy: reduce mem reads/writes as much as possible! Gathering the input for fwd and output grad for bwd can sometimes take as much time as the grouped GEMMs. We fuse gather with grouped GEMM + overlap mem access and compute to make the whole layer goes ~2x faster. - Computing top-k for expert routing can take surprisingly long, ~15-20% of the whole MoE layer! Standard top-k impl uses radix top-k algo, great for large k but suboptimal for small k. We rewrote top-k using bitonic top-k algo, and it's sometimes 20-30x faster than pytorch's top-k! All the main kernels are written in Cute-DSL so they should be easy to extend (and install :D). Hopper kernels are out, Blackwell kernels are just about ready. MoE models used to be 2x less hardware-efficient to train, hopefully Sonic-MOE will change that.
Wentao Guo@WentaoGuo7

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English
30
167
1.5K
158.7K
Wenxuan Tan
Wenxuan Tan@EdenTan20·
RT @haozhangml: One of the most interesting things I’ve been working on recently: Jacobi Forcing -- a recipe that turns any autoregressive…
English
0
3
0
11
Super Dario
Super Dario@inductionheads·
@apathium65906 @Tim_Dettmers Yes this is by far the biggest glaring issue with the piece and it’s almost treated as conventional wisdom while having no basis whatsoever
English
1
0
0
60
Wenxuan Tan retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
SGLang diffusion has now supported the most popular cache framework for DiTs: Cache-DiT! With full support of its amazing cache feature, SGLang-Diffusion delivers speedup in 20% ~ 165%. By only inserting a few env variables shown in the picture, the inference speed is accelerated by 46%, with the result with/without cache being: For more guidance on how to use SGLang with Cache-DiT, see below👇
LMSYS Org tweet mediaLMSYS Org tweet mediaLMSYS Org tweet media
English
1
11
47
6.3K
Wenxuan Tan retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
blog link: lmsys.org/blog/2025-05-0… Our implementation, shown in the figure below, runs on 12 nodes in the Atlas Cloud, each equipped with 8 H100 GPUs. It uses prefill-decode disaggregation and large-scale expert parallelism (EP). By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens, which is about one-fifth the cost of the official DeepSeek Chat API. (2/4)
LMSYS Org tweet media
English
1
6
39
8.6K