Pradeep Ramani

92 posts

Pradeep Ramani

Pradeep Ramani

@_prrama

15 Trillion Human Cells + 100 Trillion Bacterial cells + 1 consciousness. Opinions are my own. Sr. Architect @NVIDIA | CUTLASS | CUDA | GPGPU

Katılım Ocak 2012
161 Takip Edilen236 Takipçiler
Pradeep Ramani retweetledi
Wentao Guo
Wentao Guo@WentaoGuo7·
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
Wentao Guo tweet media
English
14
70
341
82.7K
Pradeep Ramani retweetledi
Vijay
Vijay@__tensorcore__·
Vijay tweet media
ZXX
0
4
47
5.8K
Pradeep Ramani retweetledi
NVIDIA HPC Developer
NVIDIA HPC Developer@NVIDIAHPCDev·
🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and performance as CUTLASS C++-no compromises. The learning curve for writing optimized kernels is flattened: no more wrestling with C++ templates or long compile times. CUTLASS 4.0’s Python support delivers: 👀 🏎️ Performance on par with C++ kernels ⏱️ 100x+ faster compile times 🤔 Intuitive, Python-native syntax ⚒️ No need for NVCC installs-just pip install nvidia-cutlas-dsl and go 🤝 Seamless integration with PyTorch and the broader Python ecosystem 📚 Improved documentation and a better debugging experience: docs.nvidia.com/cutlass/ Key features in #CUTLASS 4.0: ✅ CuTe DSL: Python-native, low-level programming model mirroring CuTe C++ abstractions (layouts, tensors, thread/data hierarchy) ✅ Supports for NVIDIA Ampere, Ada, Hopper, and Blackwell Tensor Cores ✅ Examples and Jupyter notebooks for rapid onboarding ✅ Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell Whether you’re a researcher, student, or ML engineer, CUTLASS 4.0 with Python lowers the barrier to high-performance GPU programming and accelerates the path from prototype to production. 📝 Examples: github.com/NVIDIA/cutlass… 📗 Jupyter notebooks: github.com/NVIDIA/cutlass… We’re excited to see what you build-feedback and contributions welcome. 🙌 (Note: CuTe DSL is currently in public beta and will evolve with community feedback. C++ APIs remain fully supported for existing workflows).
NVIDIA HPC Developer tweet media
English
1
31
122
8.2K
Pradeep Ramani retweetledi
Vijay
Vijay@__tensorcore__·
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…
Vijay tweet media
English
16
85
422
78.5K
Pradeep Ramani retweetledi
Haicheng Wu
Haicheng Wu@asdf1234_0·
CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver the next level performance. developer.nvidia.com/blog/cuda-tool…
English
1
25
122
20.1K
Pradeep Ramani retweetledi
Vijay
Vijay@__tensorcore__·
🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
Vijay tweet media
English
6
29
123
12.4K
Pradeep Ramani
Pradeep Ramani@_prrama·
@llllvvuu @hyhieu226 The goals here are : 1. Don't materialize intermediates in HBM 2. Optimally load / store tensors == ~1 time each of A, B, C from HBM 3. Ensure you can keep the GPU compute bound via efficient fusion Dual GEMM attempts to do all 3.
English
1
0
1
92
L
L@llllvvuu·
@_prrama @hyhieu226 Could even be fewer GMEM loads in total, since IIUC, GEMM GMEM loads is not N * K + K * M but rather 2 * N * K * M / C for some constant C? So for the fused kernel would have 3 * N * K * M / C loads, vs for non-fused 2 * K * M + 2 * N * K * M / C?
English
1
0
0
84
Hieu Pham
Hieu Pham@hyhieu226·
This simple question surprisingly requires so much knowledge of modern CUDA and GPU architecture to get right. Given 3 matrices: A of size mxk, and B, C both of size kxn. You want to compute: Ax(B+C). For most values of m, n, k, which way is faster? Bonus: why?
English
80
28
337
122.9K
Pradeep Ramani retweetledi
Haicheng Wu
Haicheng Wu@asdf1234_0·
CUTLASS reached 5K stars this summer with 3.5M downloads per month. Thank you for your support! github.com/NVIDIA/cutlass/
English
8
36
326
57.8K
Pradeep Ramani retweetledi
Tri Dao
Tri Dao@tri_dao·
FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
Tri Dao tweet media
English
30
338
2.2K
328.2K
Pradeep Ramani retweetledi
Dylan Patel
Dylan Patel@dylan522p·
If you work in AI this is the highest alpha channel out there What are you doing anon? Binge these videos now. @cudamode?si=MRsyPhC2UEgIt5_e" target="_blank" rel="nofollow noopener">youtube.com/@cudamode?si=M…
Dylan Patel tweet media
SeaTac, WA 🇺🇸 English
18
154
1.9K
246.7K
Pradeep Ramani retweetledi
Jason Turner
Jason Turner@lefticus·
Find Carbon interesting? Want a modern approach to language design? WITH a compiler you can play with today? AND is prioritizing safety? AND has C++ interop? WHY haven't you looked at github.com/SerenityOS/jakt from @jntrnr and @awesomekling ?
English
11
13
139
0
Pradeep Ramani retweetledi
Greg Siskind
Greg Siskind@gsiskind·
I'm part of the pro bono litigation effort planning to quickly file a lawsuit challenging the onerous DOL wage rule impacting H-1Bs and PERMs. We're needing employers, employees and membership organizations to volunteer as plaintiffs. If interested, go to docs.google.com/forms/d/e/1FAI….
English
52
422
618
0
Pradeep Ramani retweetledi
PyTorch
PyTorch@PyTorch·
v1.6: native mixed-precision support from NVIDIA (~2x perf improvement), distributed perf improvements, new profiling tool for memory consumption, Microsoft commits to developing and maintaining Windows PyTorch. Release Notes: github.com/pytorch/pytorc… Blog:pytorch.org/blog/pytorch-1…
English
4
228
757
0
Pradeep Ramani retweetledi
Andrew Ng
Andrew Ng@AndrewYNg·
New @ICEgov policy regarding F-1 visa international students is horrible & will hurt the US, students, and universities. Pushes universities to offer in-person classes even if unsafe or no pedagogical benefit, or students to leave US amidst pandemic and risk inability to return.
English
45
747
3.2K
0
Pradeep Ramani retweetledi
Andrea Ventura
Andrea Ventura@aventura71·
A very sad day for US science and innovation. We will pay a hefty price for this demagogic insanity. 90% of my lab, myself included, is made of immigrants.
English
6
68
493
0
Pradeep Ramani retweetledi
Sundar Pichai
Sundar Pichai@sundarpichai·
Immigration has contributed immensely to America’s economic success, making it a global leader in tech, and also Google the company it is today. Disappointed by today’s proclamation - we’ll continue to stand with immigrants and work to expand opportunity for all.
English
1.2K
9.5K
63.3K
0
Pradeep Ramani
Pradeep Ramani@_prrama·
People are already so stressed out, stranded in the US with no Visa and No medical Insurance - and booking Evac flights via @airindiain is a nightmare !. No clarity, horrible customer service, dead website links and phone numbers !. FIX IT ! @PMOIndia @airindiain #AllowPvt
English
1
1
1
0
Pradeep Ramani
Pradeep Ramani@_prrama·
Trying to book evacuation flights via Air India is probably the worst experience one can ever had dealing with any business ! If you are incapable of providing ANY level of service - don't do it ! Zero leadership, Zero Service, Zero transparency ! #AirIndiaSucks
English
0
0
1
0