
Lorenzo Garcia
51 posts





Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sparse Attention, and recently @NousResearch 's Lighthouse Attention. BLASST by NVIDIA, from paper Dynamic Blocked Attention Sparsity via Softmax Thresholding, attempts to sparsify attention in a different way, leveraging a similar rescale factor threshold idea from Flash Attention 4. We expect to see more interesting sparse attention techniques in the future. arxiv.org/abs/2512.12087 (2/4)






The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.


Sonnet 4.6 beating Opus 4.6 on AIRD Kernels Hard (kernel optimization)





🚀 AI optimizes tensor kernels to run 17x faster than human expert designs! [ADRS Blog] Programming hardware accelerators is notoriously hard. We describe Autocomp, the first LLM-driven optimizer for tensor accelerators, which outperforms hand-tuned expert kernels on AWS Trainium by up to 17x! ✍️ Read the blog: adrs-ucb.notion.site/autocomp 📖 ADRS Blog Series: ucbskyadrs.github.io 📃 Autocomp Paper: arxiv.org/pdf/2505.18574 👩💻 Code: github.com/ucb-bar/autoco…
















