Pradeep Ramani

92 posts

Pradeep Ramani

@_prrama

15 Trillion Human Cells + 100 Trillion Bacterial cells + 1 consciousness. Opinions are my own. Sr. Architect @NVIDIA | CUTLASS | CUDA | GPGPU

Katılım Ocak 2012

161 Takip Edilen236 Takipçiler

Pradeep Ramani retweetledi

Wentao Guo@WentaoGuo7·10 Tem

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

English

341

82.7K

Pradeep Ramani retweetledi

Vijay@__tensorcore__·19 May

ZXX

5.8K

Pradeep Ramani retweetledi

NVIDIA HPC Developer@NVIDIAHPCDev·14 May

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and performance as CUTLASS C++-no compromises. The learning curve for writing optimized kernels is flattened: no more wrestling with C++ templates or long compile times. CUTLASS 4.0’s Python support delivers: 👀 🏎️ Performance on par with C++ kernels ⏱️ 100x+ faster compile times 🤔 Intuitive, Python-native syntax ⚒️ No need for NVCC installs-just pip install nvidia-cutlas-dsl and go 🤝 Seamless integration with PyTorch and the broader Python ecosystem 📚 Improved documentation and a better debugging experience: docs.nvidia.com/cutlass/ Key features in #CUTLASS 4.0: ✅ CuTe DSL: Python-native, low-level programming model mirroring CuTe C++ abstractions (layouts, tensors, thread/data hierarchy) ✅ Supports for NVIDIA Ampere, Ada, Hopper, and Blackwell Tensor Cores ✅ Examples and Jupyter notebooks for rapid onboarding ✅ Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell Whether you’re a researcher, student, or ML engineer, CUTLASS 4.0 with Python lowers the barrier to high-performance GPU programming and accelerates the path from prototype to production. 📝 Examples: github.com/NVIDIA/cutlass… 📗 Jupyter notebooks: github.com/NVIDIA/cutlass… We’re excited to see what you build-feedback and contributions welcome. 🙌 (Note: CuTe DSL is currently in public beta and will evolve with community feedback. C++ APIs remain fully supported for existing workflows).

English

122

8.2K

Pradeep Ramani retweetledi

Vijay@__tensorcore__·13 May

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

English

422

78.5K

Pradeep Ramani retweetledi

Haicheng Wu@asdf1234_0·1 Şub

CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver the next level performance. developer.nvidia.com/blog/cuda-tool…

English

122

20.1K

Pradeep Ramani retweetledi

Vijay@__tensorcore__·25 Oca

🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚

English

123

12.4K

Pradeep Ramani@_prrama·3 Eyl

@llllvvuu @hyhieu226 The goals here are : 1. Don't materialize intermediates in HBM 2. Optimally load / store tensors == ~1 time each of A, B, C from HBM 3. Ensure you can keep the GPU compute bound via efficient fusion Dual GEMM attempts to do all 3.

English

L@llllvvuu·3 Eyl

@_prrama @hyhieu226 Could even be fewer GMEM loads in total, since IIUC, GEMM GMEM loads is not N * K + K * M but rather 2 * N * K * M / C for some constant C? So for the fused kernel would have 3 * N * K * M / C loads, vs for non-fused 2 * K * M + 2 * N * K * M / C?

English

Hieu Pham@hyhieu226·2 Eyl

This simple question surprisingly requires so much knowledge of modern CUDA and GPU architecture to get right. Given 3 matrices: A of size mxk, and B, C both of size kxn. You want to compute: Ax(B+C). For most values of m, n, k, which way is faster? Bonus: why?

English

337

122.9K

Pradeep Ramani retweetledi

Haicheng Wu@asdf1234_0·19 Ağu

CUTLASS reached 5K stars this summer with 3.5M downloads per month. Thank you for your support! github.com/NVIDIA/cutlass/

English

326

57.8K

Pradeep Ramani retweetledi

Tri Dao@tri_dao·11 Tem

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

English

338

2.2K

328.2K

Pradeep Ramani retweetledi

Dylan Patel@dylan522p·10 Haz

If you work in AI this is the highest alpha channel out there What are you doing anon? Binge these videos now. @cudamode?si=MRsyPhC2UEgIt5_e" target="_blank" rel="nofollow noopener">youtube.com/@cudamode?si=M…

SeaTac, WA 🇺🇸 English

154

1.9K

246.7K

Pradeep Ramani retweetledi

Vijay@__tensorcore__·9 Haz

We gave our first in depth publicly available talk on CUTLASS 3.x and it’s up on YouTube now!

Andreas Köpf@neurosp1ke

The CUTLASS/TensorCores/Hopper lecture covered quite advanced cuda programming. I guess we need further ramp-up lectures to make these topics more accessible. Recoding: youtu.be/hQ9GPnV0-50?si… Slides: drive.google.com/file/d/18sthk6…

English

7.3K

Pradeep Ramani retweetledi

Jason Turner@lefticus·22 Tem

Find Carbon interesting? Want a modern approach to language design? WITH a compiler you can play with today? AND is prioritizing safety? AND has C++ interop? WHY haven't you looked at github.com/SerenityOS/jakt from @jntrnr and @awesomekling ?

English

139

Pradeep Ramani retweetledi

Greg Siskind@gsiskind·7 Eki

I'm part of the pro bono litigation effort planning to quickly file a lawsuit challenging the onerous DOL wage rule impacting H-1Bs and PERMs. We're needing employers, employees and membership organizations to volunteer as plaintiffs. If interested, go to docs.google.com/forms/d/e/1FAI….

English

422

618

Pradeep Ramani retweetledi

PyTorch@PyTorch·28 Tem

v1.6: native mixed-precision support from NVIDIA (~2x perf improvement), distributed perf improvements, new profiling tool for memory consumption, Microsoft commits to developing and maintaining Windows PyTorch. Release Notes: github.com/pytorch/pytorc… Blog:pytorch.org/blog/pytorch-1…

English

228

757

Pradeep Ramani retweetledi

Andrew Ng@AndrewYNg·7 Tem

New @ICEgov policy regarding F-1 visa international students is horrible & will hurt the US, students, and universities. Pushes universities to offer in-person classes even if unsafe or no pedagogical benefit, or students to leave US amidst pandemic and risk inability to return.

English

747

3.2K

Pradeep Ramani retweetledi

Andrea Ventura@aventura71·23 Haz

A very sad day for US science and innovation. We will pay a hefty price for this demagogic insanity. 90% of my lab, myself included, is made of immigrants.

English

493

Pradeep Ramani retweetledi

Sundar Pichai@sundarpichai·23 Haz

Immigration has contributed immensely to America’s economic success, making it a global leader in tech, and also Google the company it is today. Disappointed by today’s proclamation - we’ll continue to stand with immigrants and work to expand opportunity for all.

English

1.2K

9.5K

63.3K

Pradeep Ramani@_prrama·5 Haz

People are already so stressed out, stranded in the US with no Visa and No medical Insurance - and booking Evac flights via @airindiain is a nightmare !. No clarity, horrible customer service, dead website links and phone numbers !. FIX IT ! @PMOIndia @airindiain #AllowPvt

English

Pradeep Ramani@_prrama·5 Haz

Trying to book evacuation flights via Air India is probably the worst experience one can ever had dealing with any business ! If you are incapable of providing ANY level of service - don't do it ! Zero leadership, Zero Service, Zero transparency ! #AirIndiaSucks

English

Keşfet

@tedzadouri @tri_dao @llllvvuu @hyhieu226 @awesomekling @ICEgov @PMOIndia @elonmusk