Vijay

1.5K posts

Vijay banner
Vijay

Vijay

@__tensorcore__

Systems and GPU Performance Mechanic - TBD Ex. CUTLASS 3.x / 4.x etc

Katılım Temmuz 2015
612 Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
Vijay
Vijay@__tensorcore__·
As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚
Vijay tweet media
English
16
18
370
27.1K
Vijay retweetledi
Perplexity
Perplexity@perplexity_ai·
We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.
Perplexity tweet media
English
75
120
1.1K
157.5K
Vijay retweetledi
resham ☻
resham ☻@Reshusaur·
new walk of shame: agent still working, but the cafe closed
resham ☻ tweet media
English
263
189
5.5K
599.1K
Vijay
Vijay@__tensorcore__·
@PatrickToulme @AlpinDale This ain’t true. You have the nvvm dialect for native PTX authoring too without “escape hatches”
English
0
0
6
134
Patrick C Toulme
Patrick C Toulme@PatrickToulme·
Exactly. inline_asm is a text escape hatch — same mechanism as asm() in C++. String of PTX, LLVM constraint bridging, no validation, no composition. CuTe-DSL itself is a DSL (over CuTe atoms); inline_asm is the hole in it. pyptx is an actual DSL at the PTX layer: typed calls, validator, parser, transpiler.
English
1
1
11
1.3K
Patrick C Toulme
Patrick C Toulme@PatrickToulme·
Launching pyptx — a Python DSL for writing NVIDIA PTX kernels. One PTX instruction = one Python call. Write pure PTX in Python. Direct Hopper + Blackwell support: wgmma, TMA, tcgen05, mbarriers. JAX + PyTorch integration. Includes GEMM, grouped GEMM, RMSNorm, SwiGLU, and a PTX→Python transpiler pip install pyptx[torch] pip install pyptx[jax] github.com/patrick-toulme…
English
34
135
1.1K
179.6K
Vijay retweetledi
tender (mlsys 5/18-21)
tender (mlsys 5/18-21)@tenderizzation·
[ENG SUB] how it feels to use eager pytorch in 2025
English
28
60
465
84.2K
Vijay retweetledi
Alex Zhurkevich
Alex Zhurkevich@cudagdb·
Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.
English
6
9
136
7.5K
Vijay retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…
English
46
185
1.8K
211.8K
Vijay retweetledi
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.
Yuchen Jin tweet media
English
41
31
649
53.9K
Vijay retweetledi
Shengjia Zhao
Shengjia Zhao@shengjia_zhao·
Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introduci…
Shengjia Zhao tweet media
English
75
171
1.7K
232.5K
Vijay retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
728
1.2K
10.4K
4.5M
Vijay retweetledi
Vijay retweetledi
Alex Zhurkevich
Alex Zhurkevich@cudagdb·
Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…
English
13
50
333
147.5K
Vijay retweetledi
Vijay retweetledi
Anne Ouyang
Anne Ouyang@anneouyang·
Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.
Anne Ouyang tweet media
English
47
46
517
133.3K
Vijay retweetledi
Rupanshu Soi
Rupanshu Soi@rupanshusoi·
The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n
Rupanshu Soi tweet media
English
1
12
96
4.6K
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
31
229
1.8K
188.8K
Vijay retweetledi
PyTorch
PyTorch@PyTorch·
FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI
PyTorch tweet media
English
12
98
732
100.9K