rishi
2.1K posts





I like this paper late last year which proposes a more compute-efficient and potentially more performative variant of multi-head latent attention. They call it compressed convolutional attention (CCA). There are a few inspiring takeaways from it. - What they did: MLA compresses QKV to a low-rank latent space, but it's mainly for KV caching. QKV is immediately projected back after storage is done. The authors instead propose to do not only caching but also computation in the latent space. Concretely this means after low-rank projection, the whole softmax attention is done there, then projected back to hidden dim (p1). Intuitively, this will not work well because low-rank projection loses a lot of information. So the authors apply a token-wise convolution and then a head-dim-wise convolution to QK. Ablations show that the two convolutions contribute a lot to improved loss and evaluation results. As is in my pseudocode (p2), they make a bunch of other interesting configurations, such as concating two of the most recent Vs, and adding a "QK mean" to Q and K to mix them together. And this can be made even more efficient if we add grouped query attention on top, ie. basically expand KV by num_groups so that we use even fewer parameters. - How it performs: the authors did some detailed evaluations against MHA, MLA, and GQA. They make sure to match the total params count, and for MLA and GQA they also match the KV cache compression rate. They show that the GQA-enhanced CCA perform on par with the lossless MHA on several benchmarks (HellaSwag, Winogrande, the usual stuff) as well as the final training loss. They note that this is especially handy for MoE models, because a smaller attention module means a bigger MoE module and specially each expert with a bigger size. Overall, this is very cool work, and belongs to the line of research that lets attention stay at quadratic but just keeps its activation size in check. It'll be interesting to calculate a FLOP breakdown between this (especially the GQA-enhanced one) and the new Deepseek Sparse Attention, since DSA is by design lossless and potentially also very efficient.









FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI

Today I’m sharing a new research paper that explores a new idea in mixture of experts architecture called “DynaMoE”. DynaMoE is a Mixture-of-Experts framework where: - the number of active experts per token is dynamic. - the number of all experts can be scheduled differently across layers. From my findings the best model has a descending expert scheduler, where beginning layers have the most experts and the end layer have the least (1 expert). This removes the rigid Top-K routing used in most MoE models and improves parameter efficiency and training stability. Paper: arxiv.org/abs/2603.01697


it's still climbing higher. 459B input and 2.6B output, 176:1.🤔







