
Vishal
1.7K posts





Launching pyptx — a Python DSL for writing NVIDIA PTX kernels. One PTX instruction = one Python call. Write pure PTX in Python. Direct Hopper + Blackwell support: wgmma, TMA, tcgen05, mbarriers. JAX + PyTorch integration. Includes GEMM, grouped GEMM, RMSNorm, SwiGLU, and a PTX→Python transpiler pip install pyptx[torch] pip install pyptx[jax] github.com/patrick-toulme…







Me and @s_gowindone tried to derive Flash Attention from scratch, starting from vanilla attention and asking: “why is this so slow?” This blog walks through the full intuition step by step would love to hear your thoughts regarding this


Today we're announcing Core Automation Our objective: systems that optimize and automate work, starting with research itself.















