rishi

2.1K posts

rishi banner
rishi

rishi

@rishiiyer01

squeezing water out of stone

Katılım Ocak 2017
670 Takip Edilen517 Takipçiler
rishi
rishi@rishiiyer01·
I be moving on the lime scooters
English
0
0
1
44
rishi
rishi@rishiiyer01·
im a neuralsurgeon
English
0
0
5
126
rishi retweetledi
Michael Andregg
Michael Andregg@michaelandregg·
We've uploaded a fruit fly. We took the @FlyWireNews connectome of the fruit fly brain, applied a simple neuron model (@Philip_Shiu Nature 2024) and used it to control a MuJoCo physics-simulated body, closing the loop from neural activation to action. A few things I want to say about what this means and where we're going at @eonsys. 🧵
English
334
1.3K
8K
1.7M
rishi
rishi@rishiiyer01·
@PandaAshwinee no ofc not, we just want fast pretrain (and cheap ctx extension) with our compression on top of fast inference. DSA is a great work
English
0
0
4
43
Ashwinee Panda
Ashwinee Panda@PandaAshwinee·
@rishiiyer01 yea i mean you're a zyphra guy so ik uk ur stuff was just noting that DSA lacking in pretrain doesnt make it DOA.
English
1
0
2
57
rishi
rishi@rishiiyer01·
I would like to quickly speak on the last point. CCA and DSA are orthogonal forms of compression. Firstly DSA cannot be fully pretrained and thus the compression in flops to accelerate training time is not negligible. Secondly the lightning indexer in DSA can be viewed as an MQA with query head compression, essentially CCMQA, in which case it is natural to compress the query heads further in the lightning indexer by reusing our lightweight preconditioner.
Muyu He@HeMuyu0327

I like this paper late last year which proposes a more compute-efficient and potentially more performative variant of multi-head latent attention. They call it compressed convolutional attention (CCA). There are a few inspiring takeaways from it. - What they did: MLA compresses QKV to a low-rank latent space, but it's mainly for KV caching. QKV is immediately projected back after storage is done. The authors instead propose to do not only caching but also computation in the latent space. Concretely this means after low-rank projection, the whole softmax attention is done there, then projected back to hidden dim (p1). Intuitively, this will not work well because low-rank projection loses a lot of information. So the authors apply a token-wise convolution and then a head-dim-wise convolution to QK. Ablations show that the two convolutions contribute a lot to improved loss and evaluation results. As is in my pseudocode (p2), they make a bunch of other interesting configurations, such as concating two of the most recent Vs, and adding a "QK mean" to Q and K to mix them together. And this can be made even more efficient if we add grouped query attention on top, ie. basically expand KV by num_groups so that we use even fewer parameters. - How it performs: the authors did some detailed evaluations against MHA, MLA, and GQA. They make sure to match the total params count, and for MLA and GQA they also match the KV cache compression rate. They show that the GQA-enhanced CCA perform on par with the lossless MHA on several benchmarks (HellaSwag, Winogrande, the usual stuff) as well as the final training loss. They note that this is especially handy for MoE models, because a smaller attention module means a bigger MoE module and specially each expert with a bigger size. Overall, this is very cool work, and belongs to the line of research that lets attention stay at quadratic but just keeps its activation size in check. It'll be interesting to calculate a FLOP breakdown between this (especially the GQA-enhanced one) and the new Deepseek Sparse Attention, since DSA is by design lossless and potentially also very efficient.

English
4
1
26
3.2K
rishi
rishi@rishiiyer01·
@PandaAshwinee My point was that CCA is easily compatible with DSA and you gain benefits from both constructively
English
1
0
3
68
rishi
rishi@rishiiyer01·
@kyle_mccleary based on what I have said so far one can speculate how it may fit
English
0
0
1
25
Kyle
Kyle@kyle_mccleary·
@rishiiyer01 Is it overlapping with the OVQ work at all?
English
1
0
0
30
Muyu He
Muyu He@HeMuyu0327·
I like this paper late last year which proposes a more compute-efficient and potentially more performative variant of multi-head latent attention. They call it compressed convolutional attention (CCA). There are a few inspiring takeaways from it. - What they did: MLA compresses QKV to a low-rank latent space, but it's mainly for KV caching. QKV is immediately projected back after storage is done. The authors instead propose to do not only caching but also computation in the latent space. Concretely this means after low-rank projection, the whole softmax attention is done there, then projected back to hidden dim (p1). Intuitively, this will not work well because low-rank projection loses a lot of information. So the authors apply a token-wise convolution and then a head-dim-wise convolution to QK. Ablations show that the two convolutions contribute a lot to improved loss and evaluation results. As is in my pseudocode (p2), they make a bunch of other interesting configurations, such as concating two of the most recent Vs, and adding a "QK mean" to Q and K to mix them together. And this can be made even more efficient if we add grouped query attention on top, ie. basically expand KV by num_groups so that we use even fewer parameters. - How it performs: the authors did some detailed evaluations against MHA, MLA, and GQA. They make sure to match the total params count, and for MLA and GQA they also match the KV cache compression rate. They show that the GQA-enhanced CCA perform on par with the lossless MHA on several benchmarks (HellaSwag, Winogrande, the usual stuff) as well as the final training loss. They note that this is especially handy for MoE models, because a smaller attention module means a bigger MoE module and specially each expert with a bigger size. Overall, this is very cool work, and belongs to the line of research that lets attention stay at quadratic but just keeps its activation size in check. It'll be interesting to calculate a FLOP breakdown between this (especially the GQA-enhanced one) and the new Deepseek Sparse Attention, since DSA is by design lossless and potentially also very efficient.
Muyu He tweet mediaMuyu He tweet mediaMuyu He tweet media
English
2
6
77
8.3K
rishi
rishi@rishiiyer01·
@HeMuyu0327 we use this in all of our models now!!! Soon more people will catch on as we release at scale
English
0
0
3
137
rishi retweetledi
Ji-Ha
Ji-Ha@Ji_Ha_Kim·
How to ("properly") orthogonalize convolutional layers for Muon optimizer Trick: Assume circular kernels to allow diagonalization Blog post + proof of concept CIFAR10 speedrun fork (unoptimized and slow for now but better convergence per step)
Ji-Ha tweet media
English
3
14
219
20.4K
rishi retweetledi
rishi
rishi@rishiiyer01·
@llllvvuu @yifan_zhang_ okay yeah what I said is correct it sees to be advantageous to do the multi-turn with mha but u still need to compute the up-projections which is annoying. mla is bad I think we can all agree
English
0
0
2
102
rishi
rishi@rishiiyer01·
Okay in multi turn prefill u are correct. I haven’t investigated this but it is most likely still advantageous to recompute the mha keys and values from the shared cache given a large number of tokens in the query. That being said we don’t have to deal with any of this irritation in gqa
English
1
0
1
203
rishi
rishi@rishiiyer01·
if i wanted to be in the nba id be in the nba but i dont want to be in the nba
English
0
0
4
144