Matthew Nicely (@manicely6005) - Twitter Profili

Matthew Nicely retweetledi

NVIDIA AI@NVIDIAAI·8 May

Perplexity runs on NVIDIA. Nice breakdown from the team on how they’re using the CUTLASS Python stack to optimize their models for inference 👇

Perplexity@perplexity_ai

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

English

20

36

541

65.8K

Matthew Nicely@manicely6005·6 May

@SemiAnalysis_ Thanks for the shout out :) Folks can also checkout the cuDNN GPU MODE presentation on FP8 & MXFP8 attention optimizations - youtube.com/watch?v=HcnybH…

YouTube

English

0

1

13

699

SemiAnalysis@SemiAnalysis_·6 May

For the past 12 years, cuDNN has been completely closed sourced (besides the .h files), until this week! OVER 20 MoE kernels & NSA sparse attention kernels from cuDNN has been open sourced! Great work to @manicely6005 & the rest of the team on seeing that parts of NVIDIA are moving towards open kernels! open source kernels drive innovation! (1/3) 🧵

English

7

66

558

46.9K

Matthew Nicely@manicely6005·2 May

NVIDIA cuDNN engineers are diving deep into high-performance computing — sharing insights on peak performance with FP8 & MXFP8 attention. 📅 May 5th, 2026 | 🕛 12pm PST 🎥 Watch live on YouTube: lnkd.in/ey6YCZV3 #GPUMODE #cuDNN #AI #FP8 @cudagdb @marksaroufim @blelbach

English

0

39

10.3K

Matthew Nicely@manicely6005·27 Nis

@maharshii Does it need to output C++ or would Python work as well?

English

1

0

39

maharshi@maharshii·27 Nis

@manicely6005 it would be nice to have a python api for defining a gemm op with custom EVTs and generating C++ / directly compiling from that (for quick prototyping, instead of writing everything by hand in cutedsl)

English

1

0

107

maharshi@maharshii·27 Nis

does nvidia cutlass have a python API for generating GEMM kernels (with collective epilogue) like we do in C++? they do have something called PyCutlass but that seems to be deprecated. note: i'm not talking about CuTeDSL where the user needs to write the whole kernel.

English

8

0

80

6.3K

Matthew Nicely@manicely6005·4 Nis

@GPU_MODE More kernels will be rolling out soon.

English

0

1

2

77

GPU MODE@GPU_MODE·4 Nis

Incredible release

Alex Zhurkevich@cudagdb

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

English

2

3

33

7.1K

Matthew Nicely@manicely6005·4 Nis

@dylan522p This has already been approved and several are in the process of being rolled out 🙂. We’ll do a better job of communicating soon.

English

0

100

Dylan Patel@dylan522p·4 Nis

NVIDIA has now open sourced their trtllmgen MoE kernels! Great to see that parts of NVIDIA move towards open kernels! open source kernels drive innovation It is now time for SemiAnalysis to nag NVIDIA to open source their trtllmgen attention kernels too!

Alex Zhurkevich@cudagdb

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

English

9

28

365

55.4K

Matthew Nicely retweetledi

GPU MODE@GPU_MODE·17 Ara

This Saturday 10:00 AM PST, the last talk of the year before we resume again on Jan 3. NVIDIA has made a profound change to its programming model with cuTile and TileIR. They've given some shorter talks online but this will be the first deep dive youtube.com/watch?v=sjkEUh…

YouTube

English

1

14

115

21.1K

Matthew Nicely retweetledi

alex zhang@a1zhang·6 Ara

btw today at 3pm PST (in ~4 hours) we're having Vicki Wang from NVIDIA giving a @GPU_MODE talk on CuTe DSL, its features, and how to use the most of it if you're currently competing in the NVFP4 Blackwell competition this will be very helpful, but it's open to anyone!

English

3

20

201

19.1K

Matthew Nicely retweetledi

Tianqi Chen@tqchenml·28 Kas

CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI

English

9

63

334

54.1K

Matthew Nicely retweetledi

Tony Mongkolsmai@tonymongkolsmai·21 Kas

Today we are releasing our first public beta of Nsight Python! The goal is to simplify the life of a Python developer by proving a pythonic way to analyze your kernel code! Check it out, provide feedback! Nsight Python — nsight-python docs.nvidia.com/nsight-python/

English

10

48

340

29.7K

Matthew Nicely retweetledi

NVIDIA AI Developer@NVIDIAAIDev·3 Kas

Ready, Set, Go! 🏎️ Create something amazing at our Blackwell NVFP4 Kernel Hackathon with @GPU_MODE. 🎊 🏆 Compete in a 4-part performance challenge to optimize low-level kernels on NVIDIA Blackwell hardware. 🥇 3 winners per challenge will receive top-tier NVIDIA hardware. The fastest of all earns the grand prize - a @Dell Pro Max with GB300. And some will earn a pass to #NVIDIAGTC in San Jose in 2026. Register now 👉 luma.com/9n27uem4

English

11

28

160

36K

Matthew Nicely retweetledi

GPU MODE@GPU_MODE·10 Kas

1,000 registrations so far!

NVIDIA AI Developer@NVIDIAAIDev

Ready, Set, Go! 🏎️ Create something amazing at our Blackwell NVFP4 Kernel Hackathon with @GPU_MODE. 🎊 🏆 Compete in a 4-part performance challenge to optimize low-level kernels on NVIDIA Blackwell hardware. 🥇 3 winners per challenge will receive top-tier NVIDIA hardware. The fastest of all earns the grand prize - a @Dell Pro Max with GB300. And some will earn a pass to #NVIDIAGTC in San Jose in 2026. Register now 👉 luma.com/9n27uem4

English

1

8

167

20K

Matthew Nicely@manicely6005·9 Eki

@elliotarledge Just curious, what tool you used to write the nvfp4 gemm?

English

0

1

26

Elliot Arledge@elliotarledge·3 Eki

timelapse #89 (12.5 hrs): - got single gpu nvfp4 gemm @ 5.2 PFLOPS working reliably (sm100) - solved ampere/hopper gemm kernel from scratch issues - split kernel optimization chapter into: - gemv, softmax, layernorm, topK, gemm (fp32 only cuda cores) - gemm (tf32, fp16, bf16, fp8, fp4) - cutting sugar made me feel great in the morning but killed me later in the day so went to bed super early - more hyperengineering tomorrow (ordered 96 diet cokes)

English

27

23

703

60.6K

Matthew Nicely@manicely6005·1 Eki

@anneouyang I wonder how it would do on the cutedsl??

English

0

1

554

Anne Ouyang@anneouyang·30 Eyl

i asked gpt to rewrite a matmul kernel in cute

English

50

89

1.9K

128.3K

Matthew Nicely@manicely6005·25 Eyl

Shoutout to @colfaxintl for formalizing the math behind CuTe layouts. Their research connects CUTLASS layout operations to solid category theory foundations. Valuable work for anyone building custom kernels. Blog: research.colfax-intl.com/categorical-fo… #CUTLASS #GPU #CUDA

English

0

1

3

216

Matthew Nicely retweetledi

Vijay@__tensorcore__·21 Tem

CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs

Vijay@__tensorcore__

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

English

4

11

119

7.7K

Matthew Nicely retweetledi

Alex Zhurkevich@cudagdb·19 Tem

Speed-of-light kernels for Blackwell decode now in flashinfer. Give em a spin.

zhyncs@zhyncs42

PR: github.com/flashinfer-ai/…

English

1

2

21

1.8K

Matthew Nicely retweetledi

Gautam Shah@GautamShah·17 Tem

CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels | NVIDIA Technical Blog developer.nvidia.com/blog/cutlass-p…

English

0

1

78

Matthew Nicely retweetledi

Gautam Shah@GautamShah·17 Tem

CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design | NVIDIA Technical Blog developer.nvidia.com/blog/cutlass-3…

English

0

1

2

68

Matthew Nicely

Keşfet