Wenxuan Tan

15 posts

Wenxuan Tan

@EdenTan20

Katılım Mayıs 2022

4 Takip Edilen1 Takipçiler

@dogacel0 @ezyang I wonder how much api costs were used per task? (Claude should return that in the session json?) Just trying to compare cost efficiency with 4.7

English

Doğaç@dogacel0·5d

I've let Fable optimize GPU kernels autonomously using "auto-gpu-kernel" harness, if it joined the NVIDIA's competition today, it would have won 🥇 in 4/5 kernels against humans. Fable can write Gluon kernels, do warp-specialization, use TMA tcgen05 etc. (Speedup vs Opus 4.8)

Doğaç@dogacel0

Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track.

English

258

27.3K

Wenxuan Tan@EdenTan20·1 Haz

@mcray69 you are retarded and stubborn, stop lying to yourself🤣🤣 it's the official api

English

mcray@mcray69·29 May

This is what happens when you pay for api services that offer a “discount”. No shit they’ll give u qwen dumbass Some chinese “devs” are next level retarded

Max For AI@MaxForAI

笑死了，Claude Opus4.8蒸馏了阿里巴巴Qwen啊🤣 通过API用中文问你是谁，会很大概率回答我是通义千问（Qwen），是阿里巴巴集团旗下的统义实验室自主研发的超大规模语言模型。

English

261

Wenxuan Tan@EdenTan20·30 May

@Andy_ShuoYang Thanks for the great work! I wonder if you used any custom agent loop for writing the cutedsl kernels? It seems like a lot of work😃

English

Shuo Yang@Andy_ShuoYang·27 May

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…

English

236

1.6K

866K

Hanchen Li@lihanc02·28 May

Opus 4.8 claims itself to be Qwen if you ask in Chinese. I guess the anti-distillation is working well for them 🤣 Photo from Wenxuan Tan’s LinkedIn

English

116.2K

Wenxuan Tan@EdenTan20·30 May

@minhhai2209 @lihanc02 the full command is there; you are welcome to reproduce it if you can put some money into the api😘

English

Wenxuan Tan@EdenTan20·10 Nis

@CliffLattner @haoailab @ye_combinator Yes, for B200 attention it does give the same speedup, though for other hardware FP4 should be faster. Our contribution is to show that FP4 attn works across language and video diffusion.

English

Cliff Lattner@CliffLattner·10 Nis

@haoailab @ye_combinator Did you try FP8? It should be the same performance uplift as FP4

English

332

Wenxuan Tan retweetledi

Hao AI Lab@haoailab·9 Nis

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: haoailab.com/blogs/attn-qat/ Code: github.com/hao-ai-lab/Fas… Checkpoints: huggingface.co/FastVideo/14B_…

English

244

37.3K

Wenxuan Tan retweetledi

Hao Zhang@haozhangml·10 Nis

We (FastVideo Team) just keep shipping! For the first time, we push FP4 QAT into attention — and it actually works very well. Let me explain why this is a bigger deal than you might think, as I think the field is underestimating how fast the FP4 era is arriving and how important it will be. Start with the hardware. On Vera Rubin, FP4 is not 2× FP8. If you check the spec, R200 delivers ~50 PFLOPS NVFP4 inference vs ~16 PFLOPS FP8 — over 3× the throughput. NVIDIA is betting VR silicon on FP4. The compute is already sitting on the die, waiting. The only question has ever been whether we can make FP4 work end-to-end. Now look at where that stack stood until this week: ✅ FP8 training — now the norm ✅ FP4 MLP inference — already usable in many production settings ❌ FP4 attention — "too sensitive, too many outliers, won't work" Attention was the missing piece. And without it, there is no such thing as an end-to-end FP4 model — you're always paying the FP8 tax on the part of the network that scales worst with context length. The one component we most want in low precision was the one component nobody had a working recipe for. This work basically fills that last gap. FP4 MLP + FP4 attention is now a complete FP4 inference path. That's what I mean when I say this closes the loop on FP4. And if FP4 attention is already here in 2026 — FP2 is closer than the field thinks. 🫡🫡 Go read the thread, try the code, tell us where it breaks

Hao AI Lab@haoailab

English

4.5K

Wenxuan Tan retweetledi

LMSYS Org@lmsysorg·20 Oca

🚀 SGLang-Diffusion Update: Two Months In! Since launch, we've optimized SGLang-Diffusion to be 1.5x faster, achieving state-of-the-art inference speeds (up to 5x vs others). Key Updates: 🔥 New Models: Day-0 support for Flux.2, Qwen-Image series, Z-Image-Turbo, GLM-Image and more ⚡️ Performance: Integrated Cache-DiT (up to +169% speedup) & Layerwise Offload 🛠️ Features: Full LoRA HTTP API + ComfyUI custom node support ⚙️ Hardware: Optimized for NVIDIA (4090/5090), AMD, and MUSA Huge thanks to our open-source community and partners for the support! 🙌

English

13.1K

Wenxuan Tan@EdenTan20·19 Ara

@tri_dao Great work! Any thoughts on how to extend this to multi-node RDMA?

English

552

Tri Dao@tri_dao·19 Ara

This is what we've been coking for the last 9 months: make MoEs training goes ~2x faster and ~2x less memory! Highlights: - MoE typically takes the most time and memory in modern models. Turns out one can mathematically rewrite the MoE backward pass to reduce the activation mem you need to store in the fwd by ~2x, resulting in the same gradients with no extra matmul recomputation. I really like this result, as it combines both algorithmic and systems insights. - Analyzing bottlenecks in MoE layer leads to a natural optimization stragegy: reduce mem reads/writes as much as possible! Gathering the input for fwd and output grad for bwd can sometimes take as much time as the grouped GEMMs. We fuse gather with grouped GEMM + overlap mem access and compute to make the whole layer goes ~2x faster. - Computing top-k for expert routing can take surprisingly long, ~15-20% of the whole MoE layer! Standard top-k impl uses radix top-k algo, great for large k but suboptimal for small k. We rewrote top-k using bitonic top-k algo, and it's sometimes 20-30x faster than pytorch's top-k! All the main kernels are written in Cute-DSL so they should be easy to extend (and install :D). Hopper kernels are out, Blackwell kernels are just about ready. MoE models used to be 2x less hardware-efficient to train, hopefully Sonic-MOE will change that.

Wentao Guo@WentaoGuo7

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

167

1.5K

158.7K

Wenxuan Tan@EdenTan20·18 Ara

RT @haozhangml: One of the most interesting things I’ve been working on recently: Jacobi Forcing -- a recipe that turns any autoregressive…

English

Wenxuan Tan@EdenTan20·13 Ara

@n0th1ng1016 玩原神玩的

日本語

Wenxuan Tan@EdenTan20·11 Ara

@inductionheads @apathium65906 @Tim_Dettmers On the contrary, there is no basis whatsoever that any architecture scales better than transformers

English

Super Dario@inductionheads·10 Ara

@apathium65906 @Tim_Dettmers Yes this is by far the biggest glaring issue with the piece and it’s almost treated as conventional wisdom while having no basis whatsoever

English

Tim Dettmers@Tim_Dettmers·10 Ara

My new blog post discusses the physical reality of computation and why this means we will not see AGI or any meaningful superintelligence: timdettmers.com/2025/12/10/why…

English

163

172

1.4K

615.5K

Wenxuan Tan retweetledi

LMSYS Org@lmsysorg·7 Ara

SGLang diffusion has now supported the most popular cache framework for DiTs: Cache-DiT! With full support of its amazing cache feature, SGLang-Diffusion delivers speedup in 20% ~ 165%. By only inserting a few env variables shown in the picture, the inference speed is accelerated by 46%, with the result with/without cache being: For more guidance on how to use SGLang with Cache-DiT, see below👇

English

6.3K

Wenxuan Tan retweetledi

LMSYS Org@lmsysorg·5 May

blog link: lmsys.org/blog/2025-05-0… Our implementation, shown in the figure below, runs on 12 nodes in the Atlas Cloud, each equipped with 8 H100 GPUs. It uses prefill-decode disaggregation and large-scale expert parallelism (EP). By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens, which is about one-fifth the cost of the official DeepSeek Chat API. (2/4)

English

8.6K

Keşfet

@dogacel0 @ezyang @mcray69 @Andy_ShuoYang @minhhai2209 @lihanc02 @CliffLattner @haoailab