Vijay

1.5K posts

Vijay

@__tensorcore__

Systems and GPU Performance Mechanic - TBD Ex. CUTLASS 3.x / 4.x etc

Katılım Temmuz 2015

612 Takip Edilen2.5K Takipçiler

Sabitlenmiş Tweet

Vijay@__tensorcore__·23 Oca

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚

English

370

27.1K

Vijay retweetledi

Perplexity@perplexity_ai·6 May

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

English

120

1.1K

157.5K

Vijay@__tensorcore__·1 May

@cHHillee @tenderizzation One could say you’re still a junior celebrity

English

Vijay@__tensorcore__·1 May

@cHHillee @tenderizzation Learning to be a celebrity I see!

English

449

Horace He@cHHillee·1 May

??? Is this some kind of infinite engagement glitch?

Allen Braden@allen_explains

🚨 A junior at Jane Street reportedly landed a $220K–$600K role because he used AI to analyze trillions of data points faster than most teams ever could. In this 1-hour lecture, he breaks down the exact system behind it: • how he researches massive datasets • how AI finds patterns humans miss • how his machine turns raw data into decisions • how you can apply the same thinking yourself Skip Netflix tonight. Watch this instead. One hour could completely change how you think about research, AI, and opportunity.

English

922

134.4K

Vijay retweetledi

resham ☻@Reshusaur·28 Nis

new walk of shame: agent still working, but the cafe closed

English

263

189

5.5K

599.1K

Vijay@__tensorcore__·26 Nis

@PatrickToulme @AlpinDale This ain’t true. You have the nvvm dialect for native PTX authoring too without “escape hatches”

English

134

Patrick C Toulme@PatrickToulme·25 Nis

Exactly. inline_asm is a text escape hatch — same mechanism as asm() in C++. String of PTX, LLVM constraint bridging, no validation, no composition. CuTe-DSL itself is a DSL (over CuTe atoms); inline_asm is the hole in it. pyptx is an actual DSL at the PTX layer: typed calls, validator, parser, transpiler.

English

1.3K

Patrick C Toulme@PatrickToulme·25 Nis

Launching pyptx — a Python DSL for writing NVIDIA PTX kernels. One PTX instruction = one Python call. Write pure PTX in Python. Direct Hopper + Blackwell support: wgmma, TMA, tcgen05, mbarriers. JAX + PyTorch integration. Includes GEMM, grouped GEMM, RMSNorm, SwiGLU, and a PTX→Python transpiler pip install pyptx[torch] pip install pyptx[jax] github.com/patrick-toulme…

English

135

1.1K

179.6K

Vijay retweetledi

tender (mlsys 5/18-21)@tenderizzation·7 Kas

[ENG SUB] how it feels to use eager pytorch in 2025

English

465

84.2K

Vijay retweetledi

Alex Zhurkevich@cudagdb·23 Nis

Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.

English

136

7.5K

Vijay retweetledi

Kimi.ai@Kimi_Moonshot·21 Nis

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…

English

185

1.8K

211.8K

Vijay retweetledi

Yuchen Jin@Yuchenj_UW·8 Nis

Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.

English

649

53.9K

Vijay retweetledi

Shengjia Zhao@shengjia_zhao·8 Nis

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introduci…

English

171

1.7K

232.5K

Vijay retweetledi

Alexandr Wang@alexandr_wang·8 Nis

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

728

1.2K

10.4K

4.5M

Vijay retweetledi

Ji-Ha@Ji_Ha_Kim·31 Mar

Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps!

Jack Zhang@jcz42

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

140

15.7K

Vijay retweetledi

Alex Zhurkevich@cudagdb·4 Nis

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

English

333

147.5K

Vijay retweetledi

Edward Z. Yang@ezyang·27 Mar

In my opinion, here are the most important ideas of CuTe Layouts (arxiv.org/pdf/2603.02298) 🧵

English

251

15.6K

Vijay retweetledi

Tri Dao@tri_dao·17 Mar

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/stat…

Albert Gu@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

112

845

77.5K

Vijay@__tensorcore__·12 Mar

ai.meta.com/blog/meta-mtia…

ZXX

937

Vijay retweetledi

Anne Ouyang@anneouyang·11 Mar

Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.

English

517

133.3K

Vijay retweetledi

Rupanshu Soi@rupanshusoi·6 Mar

The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n

English

4.6K

Vijay@__tensorcore__·5 Mar

@bingxu_ @tri_dao Not sure if that’s the right way (for now)

English

Bing Xu@bingxu_·5 Mar

@__tensorcore__ @tri_dao I am doing “human-out-of-the-loop”, so Claude doesn’t work :)

English

250

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

229

1.8K

188.8K

Vijay retweetledi

PyTorch@PyTorch·5 Mar

FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI

English

732

100.9K

Keşfet

@cHHillee @tenderizzation @PatrickToulme @AlpinDale @Standard_Kernel @elonmusk @BarackObama @taylorswift13