bikram_

1.3K posts

bikram_

bikram_

@nearlypi

🧠 Turning thoughts into threads ⚙️

Katılım Ağustos 2021
5.2K Takip Edilen130 Takipçiler
Sabitlenmiş Tweet
bikram_
bikram_@nearlypi·
This repository is a public log of my learning, experiments, and projects as I dive deep into: - GPU architecture - CUDA programming - Memory hierarchies - Parallelism - Acceleration for deep learning and scientific computing github.com/bikrammajhi/10…
English
1
0
2
209
bikram_ retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
Dissecting Nvidia Blackwell - Tensor Cores, PTX Instructions, SASS, Floorsweep, Yield Microbenchmarking, tcgen05, 2SM MMA, UMMA, TMA, LDGSTS, UBLKCP, Speed of Light, Distributed Shared Memory, GPC Floorsweeps, SM Yield newsletter.semianalysis.com/p/dissecting-n…
English
4
32
192
31.5K
Jędrzej Maczan
Jędrzej Maczan@jedmaczan·
I built a tiny-vllm in C++ and CUDA - paged attention - continuous batching - educational - 100% human-written™ And now I writing a course where you will build your own vLLM yourself. Still work in progress, I'll finish by the end of April. All for free ofc, just a GitHub repo
English
15
30
593
17.9K
bikram_ retweetledi
saksham
saksham@sakshambatraa·
for my next adventure, @michael_trbo and I will be working together to build a tinyLPU! for our first checkpoint, we reinvented the MXM: the language processing unit's matrix multiplication engine. here's how we did it
English
9
18
93
4.3K
bikram_ retweetledi
Daniel Vega-Myhre
Daniel Vega-Myhre@vega_myhre·
New blog post: "MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX": danielvegamyhre.github.io/2026/03/29/mxf… As someone who works on MXFP8 training, I was interested in deeply understanding GEMM design for this numerical format. In this post, we write a MXFP8 GEMM with CUDA + PTX, and iteratively optimize to reach cuBLAS-like performance (for some shapes!). It includes technical deep-dives into all the weird constraints and design challenges introduced by MXFP8. My brain is absolutely fried on CUDA+PTX now, so time to move onto other things (CuTeDSL?) - but in the meantime, time for me to go touch some grass
Daniel Vega-Myhre tweet mediaDaniel Vega-Myhre tweet mediaDaniel Vega-Myhre tweet media
English
10
54
393
20.7K
bikram_
bikram_@nearlypi·
PyTorch → CUDA “bro wrote 500 lines just to rediscover matmul” CUDA → PyTorch “you don’t optimize you just pray torch.compile does something”
English
0
0
0
19
bikram_ retweetledi
Maarten Grootendorst
Maarten Grootendorst@MaartenGr·
Finally, "A Visual Guide to Mixture of Experts" is the last in the reading list. It can be applied to both Transformers and Mamba architectures, so it's best to leave this to last. You can view this as an extension of Chapter 3 and the Mamba guide. newsletter.maartengrootendorst.com/p/a-visual-gui…
English
1
5
20
2.8K
bikram_ retweetledi
Maarten Grootendorst
Maarten Grootendorst@MaartenGr·
Happy to introduce my video on this alternative LLM architecture, Mamba and State Space Models! I wanted to do it for a while now and finally found the time to work on animating my visual guide. Expect many, many, many visuals! Link in comment 👇
English
5
95
618
34.1K
bikram_ retweetledi
Haocheng Xi
Haocheng Xi@HaochengXiUCB·
Really exciting to see KV-cache compression getting attention. A similar bottleneck shows up beyond LLMs: for world models and autoregressive long-video generation, KV cache can quickly dominate memory and limit long-horizon consistency. Our recent work, Quant VideoGen, explores training-free 2-bit KV-cache quantization for video diffusion models, achieving up to 7.0× KV memory reduction with <4% latency overhead. Link: arxiv.org/abs/2602.02958
Haocheng Xi tweet media
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
12
68
486
52.1K
Krish Shah
Krish Shah@KrishRShah·
You can turn your laptop into a trumpet
English
226
182
3.5K
868K
bikram_ retweetledi
Jianzhu Yao
Jianzhu Yao@alexbert135·
Open-sourced IKP: Intra-Kernel Profiler for CUDA kernels. Most GPU profilers tell you what happened at the kernel level. IKP shows what happened inside the kernel, for developers, and for agents. Repo: github.com/yao-jz/intra-k… #GPU #Profiling #CUDA
Jianzhu Yao tweet media
English
8
40
307
21.8K
bikram_ retweetledi
Nash Brown
Nash Brown@nash_c_brown·
Excited to share new ThunderKittens attention kernels that match or outperform Flash Attention 4 on Blackwell GPUs! Currently only supports QK192/V128 shapes, but more coming soon. Check out the code here: github.com/HazyResearch/T… Shoutout to the FA4 team for the algorithmic innovations and to @stuart_sul for the helpful discussions.
Nash Brown tweet media
English
19
38
313
32.5K
bikram_ retweetledi
chuyi shang
chuyi shang@chuyishang·
Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!
chuyi shang tweet mediachuyi shang tweet media
English
9
66
604
31.6K