levi (@levidiamode) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

levi@levidiamode·28 Mar

Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levi@levidiamode

Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

English

21

147

1.4K

113.9K

levi@levidiamode·1h

Day 132/365 of GPU Programming Continuing my Qwen inference experiments on my local GPU. A few more things I learned today while working on inference latency: - If you want to keep a certain param size, the drafter model has to come from somewhere (eg If your deployment has a fixed param budget, the drafter model has to fit inside it). Native multi token prediction heads that ship with the model (already counted in the param budget) seem to be a better baseline than adding on an external drafter when budgets are tight. If you do need an external drafter, pruning compensating params from the main model first to keep totals constant seems to be something that's used in practice. - Naive layer-pruning rankings from FP16 don't transfer to AWQ? Tried dropping low BlockInfluence score attention layers (which seems to work on raw FP16). The redundantlooking capacity in FP16 seems to be compressed away by AWQ's calibration, so the layer that looked droppable isn't droppable anymore. Lots more to try over the next few days!

levi@levidiamode

Day 131/365 of GPU Programming I've been spending time today working on inference for Qwen3.5 (24 GatedDeltaNet layers and 8 GatedAttention layers in a 3:1 pattern) with the goal of reducing latency on my local Nvidia machine without too much of a hit on benchmark quality. Some notes to self from optimizing inference for a hybrid mamba+attention model: - I'm learning that K/V head counts can differ inside the linear-attention block. For example, this model has 16 K heads but 32 V heads (GQA2 inside GDN). From what I can tell, a lot of kernels out there assume k_heads == v_heads, so requires modifications before they can be adopted on such a setting. - Also noticed moving AWQ from g32 to g128 can change quality benchmarks by quite a few percentage points. The g128 recipe is less aggressive but recoverable with the right calibration data. - Learning that calibration data itself is a decision point. Switching from raw web text to an instruction blended corpus seems to preserve instruction following accuracy better at the same bit width (idk, maybe that's obvious to others). A great resource on the Qwen 3.5 model family is @rasbt's amazing Qwen3.5 0.8B From Scratch. Really recommend going through the jupyter notebook to get a better feel for the model architecture.

English

1

0

20

484

levi@levidiamode·1d

@NVIDIAAI Link to the repo github.com/rasbt/LLMs-fro…

English

0

2

5

347

levi@levidiamode·1d

Day 131/365 of GPU Programming I've been spending time today working on inference for Qwen3.5 (24 GatedDeltaNet layers and 8 GatedAttention layers in a 3:1 pattern) with the goal of reducing latency on my local Nvidia machine without too much of a hit on benchmark quality. Some notes to self from optimizing inference for a hybrid mamba+attention model: - I'm learning that K/V head counts can differ inside the linear-attention block. For example, this model has 16 K heads but 32 V heads (GQA2 inside GDN). From what I can tell, a lot of kernels out there assume k_heads == v_heads, so requires modifications before they can be adopted on such a setting. - Also noticed moving AWQ from g32 to g128 can change quality benchmarks by quite a few percentage points. The g128 recipe is less aggressive but recoverable with the right calibration data. - Learning that calibration data itself is a decision point. Switching from raw web text to an instruction blended corpus seems to preserve instruction following accuracy better at the same bit width (idk, maybe that's obvious to others). A great resource on the Qwen 3.5 model family is @rasbt's amazing Qwen3.5 0.8B From Scratch. Really recommend going through the jupyter notebook to get a better feel for the model architecture.

levi@levidiamode

Day 130/365 of GPU Programming I really want to improve my understanding and intuition of research in the field (so both comprehension and practice), which I think starts with just reading more papers and getting a better feel for different topics that interest me. One particular area I've been interested in recently has been long context and attention (partially motivated by my studies of attention kernels), so spending some time reading and rereading papers I've bookmarked. Some papers I'm taking a look at today: - openreview.net/pdf?id=cFu7ze7… - arxiv.org/pdf/2505.20276 - aclanthology.org/2025.emnlp-mai… - aclanthology.org/2025.emnlp-mai… If you've come across any papers recently that you thought were really worth reading, please send them my way!

English

4

10

97

4.5K

levi@levidiamode·2d

@reprompting just whatever interests me in a given week tbh

English

1

0

2

77

light@reprompting·2d

@levidiamode actually curious, how do you decide what direction to focus on? are you following some kind of long-term timeline or roadmap with the CUDA learning?

English

1

0

1

90

levi@levidiamode·2d

Day 130/365 of GPU Programming I really want to improve my understanding and intuition of research in the field (so both comprehension and practice), which I think starts with just reading more papers and getting a better feel for different topics that interest me. One particular area I've been interested in recently has been long context and attention (partially motivated by my studies of attention kernels), so spending some time reading and rereading papers I've bookmarked. Some papers I'm taking a look at today: - openreview.net/pdf?id=cFu7ze7… - arxiv.org/pdf/2505.20276 - aclanthology.org/2025.emnlp-mai… - aclanthology.org/2025.emnlp-mai… If you've come across any papers recently that you thought were really worth reading, please send them my way!

levi@levidiamode

Day 129/365 of GPU Programming Lecture 6 of CS336 was a nice review of memory hierarchy in GPUs (bank conflicts on shared memory vs memory coalescing on HBM), occupancy calculations, PyTorch profiling, Triton, and PTX. Always helps going through material again and again, especially when there are topics like profiling or Triton that I'm still quite new to.

English

2

7

80

6.8K

levi@levidiamode·2d

Link to lecture: youtu.be/xnDHaNUvHBg?si…

YouTube

English

0

1

3

316

levi@levidiamode·2d

Day 129/365 of GPU Programming Lecture 6 of CS336 was a nice review of memory hierarchy in GPUs (bank conflicts on shared memory vs memory coalescing on HBM), occupancy calculations, PyTorch profiling, Triton, and PTX. Always helps going through material again and again, especially when there are topics like profiling or Triton that I'm still quite new to.

levi@levidiamode

Day 128/365 of GPU Programming Continuing Stanford's CS336 class (Language Modeling from Scratch). Onto lecture 5 on GPUs and TPUs today! Quite similar to last year's lecture which I watched during the AMD challenge but provided a bit more detail on the TPU this time around drawing parallels between Nvidia's GPU architecture and Google's TPUs, touches on prefill/decode disaggregation and also dives into MXFP8/MXFP4 (+ associated issues with transposing) And speaking of softmax, good reminder in the lecture on how lower precision depends on the operations you're optimizing for.

English

3

6

102

6.6K

levi retweetledi

Edward Z. Yang@ezyang·3d

A thread about the history and internal implementation details of activation checkpointing APIs in PyTorch. 🧵

English

5

30

241

18.4K

levi@levidiamode·3d

@deliprao Thank you Delip 🙏

English

0

1

26

Delip Rao e/σ@deliprao·3d

@levidiamode Glad it was useful. I am watching your GPU learning journey with interest. Keep at it!

English

1

0

1

97

levi retweetledi

Delip Rao e/σ@deliprao·4d

Why softmax? This is great question and I explain it in the following way in my deep learning course: While there are historical uses of this exponential form (Boltzmann, Gibbs, Jaynes, Luce & McFadden), its use in neural networks with backprop was first by Bridle*. He essentially closed how to solve the classification head question in neural networks. (Hinton’s Boltzmann machine paper did not use backprop and he didn’t refer to this function as softmax). We also have Bridle responsible for gifting us the term “softmax” (although in reality it is softargmax). After Bridle, softmax became defacto standard for classification heads, because Chris Bishop popularized it in his textbook drawing connections to GLMs. Now as to the question why softmax and not anything else: It’s not because there is a legacy lock in effect that we continue to use softmax. There are technical reasons: - softmax was *derived* (not arbitrarily picked) from information theory (maximum entropy principle) so it has well motivated theoretical foundations - derivatives of exp were easy to compute (this was especially important in the era before autodiff when gradient functions were hand computed) - it’s strictly positive everywhere, which means every class will receive a non-zero gradient. - it is C^∞ smooth making it gradient descent friendly so it continues getting used - translation invariant, clean logprob function - cross entropy loss along with softmax produces a simple gradient form of type (a-b). So no exponentials to compute and no exploding gradients - same with jacobians - all this made softmax sticky even before the hardware appeared to support it Overall, the community stumbled on a gem, quickly realized its value, and locked in. That’s why softmax is everywhere. *Bridle paper which many do not know about: link.springer.com/chapter/10.100…

levi@levidiamode

Day 125/365 of GPU Programming One thing I'm still struggling to understand is why softmax? What is it about the softmax function that made it survive/thrive for this long? What is it about exp() compared to another positive, monotonic, differentiable function that is so sticky? So studying softmax functions in a bit more depth today, taking a look at optimizations via SFUs on Nvidia GPUs and listening to the GOATs (Andrew Ng, Hinton, etc) explain the reasoning behind softmax as a primary choice. If anyone has good resources that dive into softmax and softmax alternatives, please share!

English

10

71

711

61.7K

levi@levidiamode·4d

Day 128/365 of GPU Programming Continuing Stanford's CS336 class (Language Modeling from Scratch). Onto lecture 5 on GPUs and TPUs today! Quite similar to last year's lecture which I watched during the AMD challenge but provided a bit more detail on the TPU this time around drawing parallels between Nvidia's GPU architecture and Google's TPUs, touches on prefill/decode disaggregation and also dives into MXFP8/MXFP4 (+ associated issues with transposing) And speaking of softmax, good reminder in the lecture on how lower precision depends on the operations you're optimizing for.

levi@levidiamode

Day 127/365 of GPU Programming Since I always wanted to tie this back to my understanding of GPUs, I've been trying to learn more about Special Function Units (SFUs) today but it's been more difficult than expected to find good public resources around the topic. One good read on the topic was this paper titled Design and Verification of an open-source SFU model for GPGPUs, which specifically looks at sin x, cos x, log2 x, 2 ^x and 1/√x. The Modal glossary provides is a nice definition but is quite light on the mechanics of the actual unit. If anyone has some resources that dive deeper into SFUs, please share!

English

2

7

90

7.4K

levi@levidiamode·4d

@yaroslavvb that makes a lot of sense, thanks Yaroslav!

English

0

317

Yaroslav Bulatov@yaroslavvb·4d

@levidiamode Many things are sticky for legacy reasons. Exp likely made its way in because of familiarity (boltzmann machines, gibbs distributions), and then we adjusted everything around it to make it work

English

2

1

16

1.6K

levi@levidiamode·9 May

Day 125/365 of GPU Programming One thing I'm still struggling to understand is why softmax? What is it about the softmax function that made it survive/thrive for this long? What is it about exp() compared to another positive, monotonic, differentiable function that is so sticky? So studying softmax functions in a bit more depth today, taking a look at optimizations via SFUs on Nvidia GPUs and listening to the GOATs (Andrew Ng, Hinton, etc) explain the reasoning behind softmax as a primary choice. If anyone has good resources that dive into softmax and softmax alternatives, please share!

levi@levidiamode

Day 124/365 of GPU Programming Since I've been studying state space models and hybrid attention architectures recently, I'm spending today using some of these models and seeing what kind of inference optimizations I can play around with. Mainly trying out Qwen 3.5 and looking at Qwen 3.5 with its gated attention layers. Also first time working with Hugging Face, which has been fun. Really nice how easy they make comparing models and downloading weights. Will probably spend the next few days looking into different ways to minimize inference latency, reading through existing solutions and trying out different things on my local machine with whatever models I can fit.

English

11

12

139

78.8K

levi@levidiamode·4d

@charles_irl ah nice, will have to give these two a read! thanks for sharing Charles 🙏

English

0

2

52

Charles 🎉 Frye@charles_irl·5d

@levidiamode which they eventually explained themselves together.ai/blog/flashatte…

English

1

6

126

levi@levidiamode·5d

Day 127/365 of GPU Programming Since I always wanted to tie this back to my understanding of GPUs, I've been trying to learn more about Special Function Units (SFUs) today but it's been more difficult than expected to find good public resources around the topic. One good read on the topic was this paper titled Design and Verification of an open-source SFU model for GPGPUs, which specifically looks at sin x, cos x, log2 x, 2 ^x and 1/√x. The Modal glossary provides is a nice definition but is quite light on the mechanics of the actual unit. If anyone has some resources that dive deeper into SFUs, please share!

levi@levidiamode

Day 126/365 of GPU Programming Continuing down the softmax path and learning some more fundamental optimization stuff today. Some good resources I've come across so far: - youtube.com/watch?v=MlivXh… (CMU Lecture on softmax) - youtube.com/watch?v=p-6wUO… (Why do Neural Networks love the Softmax?) - developer.nvidia.com/blog/making-so… (Softmax on Blackwell) - youtube.com/watch?v=ytbYRI… ( Softmax Function Explained with 3D Visuals) - youtube.com/watch?v=FYpwef… (Softmax in Pytorch) - youtube.com/watch?v=PHP8be… (Geoff Hinton on softmax) - youtube.com/watch?v=LLux1SW (Andrew Ng on softmax)

English

2

20

175

14.3K

levi@levidiamode·6d

Day 126/365 of GPU Programming Continuing down the softmax path and learning some more fundamental optimization stuff today. Some good resources I've come across so far: - youtube.com/watch?v=MlivXh… (CMU Lecture on softmax) - youtube.com/watch?v=p-6wUO… (Why do Neural Networks love the Softmax?) - developer.nvidia.com/blog/making-so… (Softmax on Blackwell) - youtube.com/watch?v=ytbYRI… ( Softmax Function Explained with 3D Visuals) - youtube.com/watch?v=FYpwef… (Softmax in Pytorch) - youtube.com/watch?v=PHP8be… (Geoff Hinton on softmax) - youtube.com/watch?v=LLux1SW (Andrew Ng on softmax)

YouTube

levi@levidiamode

Day 125/365 of GPU Programming One thing I'm still struggling to understand is why softmax? What is it about the softmax function that made it survive/thrive for this long? What is it about exp() compared to another positive, monotonic, differentiable function that is so sticky? So studying softmax functions in a bit more depth today, taking a look at optimizations via SFUs on Nvidia GPUs and listening to the GOATs (Andrew Ng, Hinton, etc) explain the reasoning behind softmax as a primary choice. If anyone has good resources that dive into softmax and softmax alternatives, please share!

English

4

16

199

18.9K

levi retweetledi

Mason Wang@masonwang025·8 May

incredible read and data efficiency results-- new optimizer derived from mechanistic analysis of Muon great interactive diagrams too :)

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

1

2

55

5.4K

levi@levidiamode·8 May

@gournge this is super interesting, thanks for sharing!

English

0

1

37

Filip Morawiec@gournge·7 May

@levidiamode I saw someone said Mamba clicked for them after reading that Jane Street blog on positional encodings: x.com/fleetingbytes/…

fleetingbytes@fleetingbytes

i never quite understood mamba, while reading about a jane street post on positional embedding, it suddenly made sense - and, i think it is part of a general trend that nothing is explained well until it must be explained to a mass audience (in this case myself)

English

1

0

2

141

levi@levidiamode·7 May

Day 123/365 of GPU Programming Another day, another attempt at understanding SSMs from first principles. Gaining a real intuition for them (and their hardware implications) has been harder than expected, so I'm just taking a closer look at the foundational state space model papers (Hippo, H3, Mamba family, S4/5, et cetera) today to see if I can understand their genealogy and the rationale behind their evolutions better. If anyone has specific blog posts or code that helped them get a better sense of the problem space, I'd love to know!

levi@levidiamode

Day 122/365 of GPU Programming Continuing to learn about state space models (SSM), especially the mamba model family. I find them a bit more difficult to understand than Transformers, so trying to build up a clearer picture progressively from earlier related versions like S4 and understanding the motivations (in particular, hardware related) behind their existence. Also insane how one person (Tri Dao) can be behind so many of the interesting systems papers of the last few years. Really inspirational.

English

2

7

88

10.1K

levi@levidiamode·8 May

Day 124/365 of GPU Programming Since I've been studying state space models and hybrid attention architectures recently, I'm spending today using some of these models and seeing what kind of inference optimizations I can play around with. Mainly trying out Qwen 3.5 and looking at Qwen 3.5 with its gated attention layers. Also first time working with Hugging Face, which has been fun. Really nice how easy they make comparing models and downloading weights. Will probably spend the next few days looking into different ways to minimize inference latency, reading through existing solutions and trying out different things on my local machine with whatever models I can fit.

levi@levidiamode

Day 123/365 of GPU Programming Another day, another attempt at understanding SSMs from first principles. Gaining a real intuition for them (and their hardware implications) has been harder than expected, so I'm just taking a closer look at the foundational state space model papers (Hippo, H3, Mamba family, S4/5, et cetera) today to see if I can understand their genealogy and the rationale behind their evolutions better. If anyone has specific blog posts or code that helped them get a better sense of the problem space, I'd love to know!

English

2

12

105

13.9K

levi

Keşfet