Matthew Nicely

41 posts

Matthew Nicely

Matthew Nicely

@manicely6005

NVIDIA AI SW PM for CUTLASS and cuDNN

Katılım Ağustos 2024
71 Takip Edilen185 Takipçiler
Matthew Nicely retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
For the past 12 years, cuDNN has been completely closed sourced (besides the .h files), until this week! OVER 20 MoE kernels & NSA sparse attention kernels from cuDNN has been open sourced! Great work to @manicely6005 & the rest of the team on seeing that parts of NVIDIA are moving towards open kernels! open source kernels drive innovation! (1/3) 🧵
SemiAnalysis tweet media
English
7
66
558
46.9K
maharshi
maharshi@maharshii·
@manicely6005 it would be nice to have a python api for defining a gemm op with custom EVTs and generating C++ / directly compiling from that (for quick prototyping, instead of writing everything by hand in cutedsl)
English
1
0
0
107
maharshi
maharshi@maharshii·
does nvidia cutlass have a python API for generating GEMM kernels (with collective epilogue) like we do in C++? they do have something called PyCutlass but that seems to be deprecated. note: i'm not talking about CuTeDSL where the user needs to write the whole kernel.
English
8
0
80
6.3K
Matthew Nicely
Matthew Nicely@manicely6005·
@dylan522p This has already been approved and several are in the process of being rolled out 🙂. We’ll do a better job of communicating soon.
English
0
0
0
100
Dylan Patel
Dylan Patel@dylan522p·
NVIDIA has now open sourced their trtllmgen MoE kernels! Great to see that parts of NVIDIA move towards open kernels! open source kernels drive innovation It is now time for SemiAnalysis to nag NVIDIA to open source their trtllmgen attention kernels too!
Alex Zhurkevich@cudagdb

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

English
9
28
365
55.4K
Matthew Nicely retweetledi
GPU MODE
GPU MODE@GPU_MODE·
This Saturday 10:00 AM PST, the last talk of the year before we resume again on Jan 3. NVIDIA has made a profound change to its programming model with cuTile and TileIR. They've given some shorter talks online but this will be the first deep dive youtube.com/watch?v=sjkEUh…
YouTube video
YouTube
English
1
14
115
21.1K
Matthew Nicely retweetledi
alex zhang
alex zhang@a1zhang·
btw today at 3pm PST (in ~4 hours) we're having Vicki Wang from NVIDIA giving a @GPU_MODE talk on CuTe DSL, its features, and how to use the most of it if you're currently competing in the NVFP4 Blackwell competition this will be very helpful, but it's open to anyone!
alex zhang tweet media
English
3
20
201
19.1K
Matthew Nicely retweetledi
Tianqi Chen
Tianqi Chen@tqchenml·
CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI
Tianqi Chen tweet media
English
9
63
334
54.1K
Matthew Nicely retweetledi
Tony Mongkolsmai
Tony Mongkolsmai@tonymongkolsmai·
Today we are releasing our first public beta of Nsight Python! The goal is to simplify the life of a Python developer by proving a pythonic way to analyze your kernel code! Check it out, provide feedback! Nsight Python — nsight-python docs.nvidia.com/nsight-python/
English
10
48
340
29.7K
Matthew Nicely retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
Ready, Set, Go! 🏎️ Create something amazing at our Blackwell NVFP4 Kernel Hackathon with @GPU_MODE. 🎊 🏆 Compete in a 4-part performance challenge to optimize low-level kernels on NVIDIA Blackwell hardware. 🥇 3 winners per challenge will receive top-tier NVIDIA hardware. The fastest of all earns the grand prize - a @Dell Pro Max with GB300. And some will earn a pass to #NVIDIAGTC in San Jose in 2026. Register now 👉 luma.com/9n27uem4
NVIDIA AI Developer tweet media
English
11
28
160
36K
Matthew Nicely retweetledi
GPU MODE
GPU MODE@GPU_MODE·
1,000 registrations so far!
NVIDIA AI Developer@NVIDIAAIDev

Ready, Set, Go! 🏎️ Create something amazing at our Blackwell NVFP4 Kernel Hackathon with @GPU_MODE. 🎊 🏆 Compete in a 4-part performance challenge to optimize low-level kernels on NVIDIA Blackwell hardware. 🥇 3 winners per challenge will receive top-tier NVIDIA hardware. The fastest of all earns the grand prize - a @Dell Pro Max with GB300. And some will earn a pass to #NVIDIAGTC in San Jose in 2026. Register now 👉 luma.com/9n27uem4

English
1
8
167
20K
Elliot Arledge
Elliot Arledge@elliotarledge·
timelapse #89 (12.5 hrs): - got single gpu nvfp4 gemm @ 5.2 PFLOPS working reliably (sm100) - solved ampere/hopper gemm kernel from scratch issues - split kernel optimization chapter into: - gemv, softmax, layernorm, topK, gemm (fp32 only cuda cores) - gemm (tf32, fp16, bf16, fp8, fp4) - cutting sugar made me feel great in the morning but killed me later in the day so went to bed super early - more hyperengineering tomorrow (ordered 96 diet cokes)
English
27
23
703
60.6K
Anne Ouyang
Anne Ouyang@anneouyang·
i asked gpt to rewrite a matmul kernel in cute
Anne Ouyang tweet mediaAnne Ouyang tweet media
English
50
89
1.9K
128.3K