Vijay

1.5K posts

Vijay banner
Vijay

Vijay

@__tensorcore__

가입일 Temmuz 2015
583 팔로잉2.5K 팔로워
고정된 트윗
Vijay
Vijay@__tensorcore__·
As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚
Vijay tweet media
English
16
18
372
26.8K
Vijay 리트윗함
Vijay 리트윗함
Anne Ouyang
Anne Ouyang@anneouyang·
Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.
Anne Ouyang tweet media
English
45
44
510
121.3K
Vijay 리트윗함
Rupanshu Soi
Rupanshu Soi@rupanshusoi·
The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n
Rupanshu Soi tweet media
English
1
11
96
4.5K
Vijay 리트윗함
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
30
230
1.8K
183.1K
Vijay 리트윗함
PyTorch
PyTorch@PyTorch·
FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI
PyTorch tweet media
English
12
98
731
99.5K
Vijay
Vijay@__tensorcore__·
@tri_dao @bingxu_ If you direct Claude to debug using methods you’d use as a human, it’s reasonably good at finding race conditions or hangs. Pointing it to things like compute sanitizer etc is also good
English
1
0
3
312
Vijay 리트윗함
Dylan Patel
Dylan Patel@dylan522p·
SemiAnalysis x Fluidstack are kicking off GTC with A Full-Stack AI Infra GPU Hackathon Power to Prefill, Dirt to Decode Build with the best, win prizes, and hear @marksaroufim GPU MODE, @cHHillee Thinking Machines, Thomas Raoux OpenAI, @garywu Apply below luma.com/SAxFSHack
English
7
9
89
19.2K
Vijay 리트윗함
Tri Dao
Tri Dao@tri_dao·
Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole convincing itself that the compiler is broken (so very human 😂). After 6 hours, from Claude’s partial fix, I was able to fix the hang in 10 mins. More details here: github.com/Dao-AILab/flas… I’m hoping FA5 will be written completely by AI
English
2
19
342
28.9K
Vijay 리트윗함
Tri Dao
Tri Dao@tri_dao·
I’m unreasonably excited about the fact that we wrote everything in Cute-DSL, embedded in Python. Installing / “compiling” now takes seconds instead of minutes / hours (looking at you, C++ templates). Try pip install fa4!
English
5
18
431
27.1K
Vijay 리트윗함
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
6
132
780
219.4K
Vijay 리트윗함
xjdr
xjdr@_xjdr·
i love jax and tpus, i think they are elegant and fit my mental model of how compliers and systems should work. that said, you'd have a _very_ hard time getting me to go back to using jax and tpu instead of pytorch (really cuda and CuTeDSL) and GB300NLV72s
English
12
4
191
12.5K
Vijay 리트윗함
Kion
Kion@OKfallah·
In-context learning is a hack to remind your model. CLaaS uses self-distillation to move that knowledge into weights, freeing up context.
Kion tweet media
English
3
6
17
620
Vijay
Vijay@__tensorcore__·
@henrylhtsang Print it all out. Take 5 days of PTO. Go to Puerto Vallarta. Read cover to cover sipping a pina colada.
English
0
3
34
1.4K
henry tsang
henry tsang@henrylhtsang·
any tips for reading ptx docs? just skim through it to get an idea?
English
3
2
18
4.3K
Vijay
Vijay@__tensorcore__·
@blelbach What’s the event? Is it open to the peanut gallery?
English
1
0
2
845
Bryce Adelstein Lelbach
Bryce Adelstein Lelbach@blelbach·
Attempting to get to Boston for some CUDA events at Harvard. I bought 2 refundable plane tickets and 4 refundable train tickets Flight #1: Cancelled Flight #2: Cancelled Train #1 (today): Cancelled Train #2 (tmrw): Delayed Train #3 (tmrw): On time Train #4 (tmrw): On time
English
1
0
13
4.6K
Vijay 리트윗함
Igor Babuschkin
Igor Babuschkin@ibab·
Building great AI products requires excellence in both creativity and technical execution. You need to create the right culture and enough space for good ideas to emerge and grow naturally, then fuel the best ideas with strong execution. The reason you see most good products start out as personal projects is because we are most in tune with matters when building for ourselves. Products that are built for a fictitious user almost always end up bad because you don’t get a good handle on what actually matters and you build things that don’t resonate with users. It’s not that different from creating great art.
Pedro Domingos@pmddomingos

Anthropic has no strategy. Claude Code started as someone's side project, and so did Cowork and MCP.

English
41
61
1K
80.4K
Vijay 리트윗함
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
Video generation models are improving fast—real-time autoregressive models now deliver high quality at low latency, and they’re quickly being adopted for world models and robotics applications. So what’s the problem? They’re still too slow on consumer hardware. 🚀 What if we told you that we can get true real-time 16 FPS video generation on a single RTX 5090? (1.5-12x over FA 2/3/4 on 5090, H100, B200) Today we release MonarchRT 🦋, an efficient video attention that parameterizes attention maps as (tiled) Monarch matrices and delivers real E2E gains. 📄 Paper: arxiv.org/abs/2602.12271 🌐 Website: infini-ai-lab.github.io/MonarchRT 🔗 GitHub: github.com/Infini-AI-Lab/… 🧵1/n
English
4
27
132
32.7K
Vijay 리트윗함
mike64_t
mike64_t@mike64_t·
Vibe recoded Nsight (it wasn't low level enough) [CC @SemiAnalysis_]
mike64_t tweet media
English
9
2
87
8.7K
Vijay
Vijay@__tensorcore__·
@SubhoGhosh02 It’s not the thread view, it’s the SM’s view. TMA is fundamentally a SM level operation and is programmed as such
English
1
0
2
86
subho ghosh
subho ghosh@SubhoGhosh02·
In device side we need to do tma_partition to get the thread view of global and shared mem tensors, according to tma descriptors, Create TMA multicast mask along mode 2 (N dimension for A) and along 1 for B, And issue TMA with single thread.
subho ghosh tweet media
English
2
0
2
318
subho ghosh
subho ghosh@SubhoGhosh02·
Tried out TMA multicast with CuTE, Iet me try to break this down as simple as possible in this thread.
subho ghosh tweet media
English
3
2
67
2.8K