bikram_

1.3K posts

bikram_

@nearlypi

🧠 Turning thoughts into threads ⚙️

Katılım Ağustos 2021

5.2K Takip Edilen130 Takipçiler

Sabitlenmiş Tweet

bikram_@nearlypi·27 Şub

This repository is a public log of my learning, experiments, and projects as I dive deep into: - GPU architecture - CUDA programming - Memory hierarchies - Parallelism - Acceleration for deep learning and scientific computing github.com/bikrammajhi/10…

English

209

bikram_ retweetledi

SemiAnalysis@SemiAnalysis_·1d

Dissecting Nvidia Blackwell - Tensor Cores, PTX Instructions, SASS, Floorsweep, Yield Microbenchmarking, tcgen05, 2SM MMA, UMMA, TMA, LDGSTS, UBLKCP, Speed of Light, Distributed Shared Memory, GPC Floorsweeps, SM Yield newsletter.semianalysis.com/p/dissecting-n…

English

192

31.5K

bikram_@nearlypi·2d

@jedmaczan Great stuff 👍🏾

English

Jędrzej Maczan@jedmaczan·2d

I built a tiny-vllm in C++ and CUDA - paged attention - continuous batching - educational - 100% human-written™ And now I writing a course where you will build your own vLLM yourself. Still work in progress, I'll finish by the end of April. All for free ofc, just a GitHub repo

English

593

17.9K

bikram_ retweetledi

saksham@sakshambatraa·2d

for my next adventure, @michael_trbo and I will be working together to build a tinyLPU! for our first checkpoint, we reinvented the MXM: the language processing unit's matrix multiplication engine. here's how we did it

English

4.3K

bikram_ retweetledi

Daniel Vega-Myhre@vega_myhre·3d

New blog post: "MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX": danielvegamyhre.github.io/2026/03/29/mxf… As someone who works on MXFP8 training, I was interested in deeply understanding GEMM design for this numerical format. In this post, we write a MXFP8 GEMM with CUDA + PTX, and iteratively optimize to reach cuBLAS-like performance (for some shapes!). It includes technical deep-dives into all the weird constraints and design challenges introduced by MXFP8. My brain is absolutely fried on CUDA+PTX now, so time to move onto other things (CuTeDSL?) - but in the meantime, time for me to go touch some grass

English

393

20.7K

bikram_ retweetledi

Eric Alcaide@eric_alcaide·4d

Followed this for a fun Saturday afternoon and... 1.15x faster prefill 🔥 The speedup comes not only from the math trick but also the WY kernel fusion For training short sequences caching UW caused too much overhead, so only did that >2048 ctxlen. 1.05x training🚀 Link below ⤵️

Simon V@Simon_Vt

Simple math to speed up GDN prefill veitner.bearblog.dev/simple-math-to…

English

11.3K

bikram_@nearlypi·4d

“This Kernel Was Faster Yesterday” — In Pursuit of High-Fidelity GPU Kernel Benchmarking standardkernel.com/blog/in-pursui…

English

bikram_@nearlypi·5d

PyTorch → CUDA “bro wrote 500 lines just to rediscover matmul” CUDA → PyTorch “you don’t optimize you just pray torch.compile does something”

English

bikram_ retweetledi

Maarten Grootendorst@MaartenGr·14 Eki

Finally, "A Visual Guide to Mixture of Experts" is the last in the reading list. It can be applied to both Transformers and Mamba architectures, so it's best to leave this to last. You can view this as an extension of Chapter 3 and the Mamba guide. newsletter.maartengrootendorst.com/p/a-visual-gui…

English

2.8K

bikram_ retweetledi

Maarten Grootendorst@MaartenGr·21 Mar

Happy to introduce my video on this alternative LLM architecture, Mamba and State Space Models! I wanted to do it for a while now and finally found the time to work on animating my visual guide. Expect many, many, many visuals! Link in comment 👇

English

618

34.1K

bikram_ retweetledi

Haocheng Xi@HaochengXiUCB·5d

Really exciting to see KV-cache compression getting attention. A similar bottleneck shows up beyond LLMs: for world models and autoregressive long-video generation, KV cache can quickly dominate memory and limit long-horizon consistency. Our recent work, Quant VideoGen, explores training-free 2-bit KV-cache quantization for video diffusion models, achieving up to 7.0× KV memory reduction with <4% latency overhead. Link: arxiv.org/abs/2602.02958

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

486

52.1K

bikram_@nearlypi·5d

I/O-aware Linear Attention arxiv.org/pdf/2312.06635 arxiv.org/pdf/2405.17381 arxiv.org/pdf/2401.04658

English

bikram_ retweetledi

AT@AliesTaha·6d

x.com/i/article/2037…

ZXX

623

70.6K

bikram_@nearlypi·5d

@KrishRShah 🤯🤯

QME

101

Krish Shah@KrishRShah·5d

You can turn your laptop into a trumpet

English

226

182

3.5K

868K

bikram_ retweetledi

Jianzhu Yao@alexbert135·6d

Open-sourced IKP: Intra-Kernel Profiler for CUDA kernels. Most GPU profilers tell you what happened at the kernel level. IKP shows what happened inside the kernel, for developers, and for agents. Repo: github.com/yao-jz/intra-k… #GPU #Profiling #CUDA

English

307

21.8K

bikram_ retweetledi

Edward Z. Yang@ezyang·6d

Cool pure Python implementation of CuTe layout algebra: github.com/facebookresear… -- with it, it only took a few minutes for Claude to make all of the CuTe paper arxiv.org/abs/2603.02298 have executable Python code with it too github.com/ezyang/cute-in…

English

365

34.8K

bikram_ retweetledi

Kirito (e/acc) 🏴‍☠️@bronzeagepapi·25 Mar

Groq Language Processing Unit (LPU) at home arxiv.org/abs/2408.07326

English

212

9.3K

bikram_@nearlypi·25 Mar

@nash_c_brown Awesome

English

261

bikram_ retweetledi

Nash Brown@nash_c_brown·25 Mar

Excited to share new ThunderKittens attention kernels that match or outperform Flash Attention 4 on Blackwell GPUs! Currently only supports QK192/V128 shapes, but more coming soon. Check out the code here: github.com/HazyResearch/T… Shoutout to the FA4 team for the algorithmic innovations and to @stuart_sul for the helpful discussions.

English

313

32.5K

bikram_ retweetledi

Vivek Galatage@vivekgalatage·25 Mar

Roadmap: Understanding GPU Architecture from Cornell cvw.cac.cornell.edu/gpu-architectu…

English

208

1.3K

135.6K

bikram_ retweetledi

chuyi shang@chuyishang·24 Mar

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!

English

604

31.6K

Keşfet

@jedmaczan @michael_trbo @KrishRShah @nash_c_brown @stuart_sul @elonmusk @BarackObama @taylorswift13