Vijay

1.5K posts

Vijay

@__tensorcore__

가입일 Temmuz 2015

583 팔로잉2.5K 팔로워

고정된 트윗

Vijay@__tensorcore__·23 Oca

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚

English

372

26.8K

Vijay 리트윗함

Tri Dao@tri_dao·2d

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/stat…

Albert Gu@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

113

840

72.1K

Vijay@__tensorcore__·12 Mar

ai.meta.com/blog/meta-mtia…

ZXX

881

Vijay 리트윗함

Anne Ouyang@anneouyang·11 Mar

Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.

English

510

121.3K

Vijay 리트윗함

Rupanshu Soi@rupanshusoi·6 Mar

The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n

English

4.5K

Vijay@__tensorcore__·5 Mar

@bingxu_ @tri_dao Not sure if that’s the right way (for now)

English

Bing Xu@bingxu_·5 Mar

@__tensorcore__ @tri_dao I am doing “human-out-of-the-loop”, so Claude doesn’t work :)

English

239

Vijay 리트윗함

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

230

1.8K

183.1K

Vijay 리트윗함

PyTorch@PyTorch·5 Mar

FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI

English

731

99.5K

Vijay@__tensorcore__·5 Mar

@tri_dao @bingxu_ If you direct Claude to debug using methods you’d use as a human, it’s reasonably good at finding race conditions or hangs. Pointing it to things like compute sanitizer etc is also good

English

312

Tri Dao@tri_dao·5 Mar

@bingxu_ @__tensorcore__ Ah I’ve got to try that

English

856

Vijay 리트윗함

Dylan Patel@dylan522p·27 Şub

SemiAnalysis x Fluidstack are kicking off GTC with A Full-Stack AI Infra GPU Hackathon Power to Prefill, Dirt to Decode Build with the best, win prizes, and hear @marksaroufim GPU MODE, @cHHillee Thinking Machines, Thomas Raoux OpenAI, @garywu Apply below luma.com/SAxFSHack

English

19.2K

Vijay 리트윗함

Tri Dao@tri_dao·5 Mar

Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole convincing itself that the compiler is broken (so very human 😂). After 6 hours, from Claude’s partial fix, I was able to fix the hang in 10 mins. More details here: github.com/Dao-AILab/flas… I’m hoping FA5 will be written completely by AI

English

342

28.9K

Vijay 리트윗함

Tri Dao@tri_dao·5 Mar

I’m unreasonably excited about the fact that we wrote everything in Cute-DSL, embedded in Python. Installing / “compiling” now takes seconds instead of minutes / hours (looking at you, C++ templates). Try pip install fa4!

English

431

27.1K

Vijay 리트윗함

Ted Zadouri@tedzadouri·5 Mar

English

132

780

219.4K

Vijay 리트윗함

xjdr@_xjdr·27 Şub

i love jax and tpus, i think they are elegant and fit my mental model of how compliers and systems should work. that said, you'd have a _very_ hard time getting me to go back to using jax and tpu instead of pytorch (really cuda and CuTeDSL) and GB300NLV72s

English

191

12.5K

Vijay 리트윗함

Kion@OKfallah·26 Şub

In-context learning is a hack to remind your model. CLaaS uses self-distillation to move that knowledge into weights, freeing up context.

English

620

Vijay@__tensorcore__·24 Şub

@henrylhtsang Print it all out. Take 5 days of PTO. Go to Puerto Vallarta. Read cover to cover sipping a pina colada.

English

1.4K

henry tsang@henrylhtsang·24 Şub

any tips for reading ptx docs? just skim through it to get an idea?

English

4.3K

Vijay@__tensorcore__·23 Şub

@blelbach What’s the event? Is it open to the peanut gallery?

English

845

Bryce Adelstein Lelbach@blelbach·23 Şub

Attempting to get to Boston for some CUDA events at Harvard. I bought 2 refundable plane tickets and 4 refundable train tickets Flight #1: Cancelled Flight #2: Cancelled Train #1 (today): Cancelled Train #2 (tmrw): Delayed Train #3 (tmrw): On time Train #4 (tmrw): On time

English

4.6K

Vijay 리트윗함

Igor Babuschkin@ibab·21 Şub

Building great AI products requires excellence in both creativity and technical execution. You need to create the right culture and enough space for good ideas to emerge and grow naturally, then fuel the best ideas with strong execution. The reason you see most good products start out as personal projects is because we are most in tune with matters when building for ourselves. Products that are built for a fictitious user almost always end up bad because you don’t get a good handle on what actually matters and you build things that don’t resonate with users. It’s not that different from creating great art.

Pedro Domingos@pmddomingos

Anthropic has no strategy. Claude Code started as someone's side project, and so did Cowork and MCP.

English

80.4K

Vijay 리트윗함

Infini-AI-Lab@InfiniAILab·18 Şub

Video generation models are improving fast—real-time autoregressive models now deliver high quality at low latency, and they’re quickly being adopted for world models and robotics applications. So what’s the problem? They’re still too slow on consumer hardware. 🚀 What if we told you that we can get true real-time 16 FPS video generation on a single RTX 5090? (1.5-12x over FA 2/3/4 on 5090, H100, B200) Today we release MonarchRT 🦋, an efficient video attention that parameterizes attention maps as (tiled) Monarch matrices and delivers real E2E gains. 📄 Paper: arxiv.org/abs/2602.12271 🌐 Website: infini-ai-lab.github.io/MonarchRT 🔗 GitHub: github.com/Infini-AI-Lab/… 🧵1/n

English

132

32.7K

Vijay 리트윗함

mike64_t@mike64_t·18 Şub

Vibe recoded Nsight (it wasn't low level enough) [CC @SemiAnalysis_]

English

8.7K

Vijay@__tensorcore__·12 Şub

@SubhoGhosh02 It’s not the thread view, it’s the SM’s view. TMA is fundamentally a SM level operation and is programmed as such

English

subho ghosh@SubhoGhosh02·10 Şub

In device side we need to do tma_partition to get the thread view of global and shared mem tensors, according to tma descriptors, Create TMA multicast mask along mode 2 (N dimension for A) and along 1 for B, And issue TMA with single thread.

English

318

subho ghosh@SubhoGhosh02·10 Şub

Tried out TMA multicast with CuTE, Iet me try to break this down as simple as possible in this thread.

English

2.8K

탐색

@Standard_Kernel @bingxu_ @tri_dao @marksaroufim @cHHillee @garywu @ultraproduct @henrylhtsang