Adam Zweiger

80 posts

Adam Zweiger

Adam Zweiger

@AdamZweiger

Working on learning | @MIT_CSAIL

Katılım Eylül 2022
569 Takip Edilen2K Takipçiler
Sabitlenmiş Tweet
Adam Zweiger
Adam Zweiger@AdamZweiger·
We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
Adam Zweiger tweet media
English
23
148
943
130.5K
Adam Zweiger retweetledi
Han Guo
Han Guo@HanGuo97·
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
Han Guo tweet media
English
15
101
676
189.8K
Bob McElrath
Bob McElrath@BobMcElrath·
@AdamZweiger FWIW, I implemented an online compaction from your paper using Claude in llama.cpp. Works pretty well, and very fast. Stores Q to score regions for compaction. l2k works better than your rms norm IIRC...
English
1
0
5
391
Adam Zweiger
Adam Zweiger@AdamZweiger·
The biggest reduction in KV cache memory comes not from quantization or MLA, but from latent compaction, along the sequence dimension. More strong results coming soon with Attention Matching.
Adam Zweiger@AdamZweiger

We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.

English
5
44
407
33.6K
Adam Zweiger retweetledi
Xinghong (Shin) Fu
Xinghong (Shin) Fu@shinfxh·
just got claude to explain attention matching and it made this interactive heatmap to show the relative importance of each layer/head! this might just be better than the diagrams in our own paper...
English
1
5
56
3.3K
Evan Kim
Evan Kim@evnkimm·
How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)
Evan Kim tweet media
English
13
19
168
34.4K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Fun fact: Back in 2014, Demis had a red line condition for any potential acquisition of DeepMind: "no technology coming out of DeepMind will be used for military or intelligence purposes." Google accepting this more eagerly was part of why Demis chose them over Facebook. This red line is even broader than Dario's (no mass surveillance or fully autonomous weapons), though it was quietly removed by Google 1 year ago.
English
7
38
992
81.2K
Adam Zweiger
Adam Zweiger@AdamZweiger·
@bendee983 Not yet. Two things are that most inference engines currently don't have a way of initializing directly with a KV cache, and they don't support disentangling logical cache size from physical size (which is needed for rope embeddings). These are all fixable though.
English
0
0
17
1.4K
Ben Dickson
Ben Dickson@bendee983·
@AdamZweiger This is impressive! Is it compatible with popular inference engines and current kernels? In other words, how easy is it to use it as a drop-in for whatever engine companies are using right now?
English
1
0
4
1.7K
Adam Zweiger
Adam Zweiger@AdamZweiger·
We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
Adam Zweiger tweet media
English
23
148
943
130.5K
Adam Zweiger retweetledi
Xinghong (Shin) Fu
Xinghong (Shin) Fu@shinfxh·
the solution to infinite context was just linear regression all along
Xinghong (Shin) Fu tweet media
English
32
111
1.6K
189.1K
Adam Zweiger
Adam Zweiger@AdamZweiger·
@ye_combinator thanks! were you trying gradient descent? I think for practical purposes (i.e. low number of query samples), subsetting keys is hard to beat
English
1
0
6
376
Zihao Ye
Zihao Ye@ye_combinator·
Great work! I have explored similar ideas before: tried per-layer per-head fitting for both Ck/v, it works in tasks like needle-in-haystack but training cost is too high to make it practical in production :( would love to see how you plan to make online compaction efficient.
Adam Zweiger@AdamZweiger

We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.

English
1
4
45
4.3K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Future Work: - Integrating latent compaction into inference engines (e.g. RadixAttention, varlen storage, disaggregated compaction) - Online compaction — compacting mid-trajectory repeatedly to support arbitrarily long sequences. We show initial results but more work remains.
English
1
2
26
3.4K