Adam Zweiger

76 posts

Adam Zweiger

Adam Zweiger

@AdamZweiger

Working on learning | @MIT_CSAIL

Beigetreten Eylül 2022
557 Folgt1.6K Follower
Angehefteter Tweet
Adam Zweiger
Adam Zweiger@AdamZweiger·
We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
Adam Zweiger tweet media
English
20
126
805
68.9K
Adam Zweiger retweetet
Xinghong (Shin) Fu
Xinghong (Shin) Fu@shinfxh·
just got claude to explain attention matching and it made this interactive heatmap to show the relative importance of each layer/head! this might just be better than the diagrams in our own paper...
English
1
5
57
2.9K
Evan Kim
Evan Kim@evnkimm·
How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)
Evan Kim tweet media
English
13
19
167
33.2K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Fun fact: Back in 2014, Demis had a red line condition for any potential acquisition of DeepMind: "no technology coming out of DeepMind will be used for military or intelligence purposes." Google accepting this more eagerly was part of why Demis chose them over Facebook. This red line is even broader than Dario's (no mass surveillance or fully autonomous weapons), though it was quietly removed by Google 1 year ago.
English
7
39
1K
80.9K
Adam Zweiger
Adam Zweiger@AdamZweiger·
@bendee983 Not yet. Two things are that most inference engines currently don't have a way of initializing directly with a KV cache, and they don't support disentangling logical cache size from physical size (which is needed for rope embeddings). These are all fixable though.
English
0
0
15
952
Ben Dickson
Ben Dickson@bendee983·
@AdamZweiger This is impressive! Is it compatible with popular inference engines and current kernels? In other words, how easy is it to use it as a drop-in for whatever engine companies are using right now?
English
1
0
4
1.2K
Adam Zweiger
Adam Zweiger@AdamZweiger·
We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
Adam Zweiger tweet media
English
20
126
805
68.9K
Adam Zweiger retweetet
Xinghong (Shin) Fu
Xinghong (Shin) Fu@shinfxh·
the solution to infinite context was just linear regression all along
Xinghong (Shin) Fu tweet media
English
32
112
1.6K
187.7K
Adam Zweiger
Adam Zweiger@AdamZweiger·
@ye_combinator thanks! were you trying gradient descent? I think for practical purposes (i.e. low number of query samples), subsetting keys is hard to beat
English
1
0
6
376
Zihao Ye
Zihao Ye@ye_combinator·
Great work! I have explored similar ideas before: tried per-layer per-head fitting for both Ck/v, it works in tasks like needle-in-haystack but training cost is too high to make it practical in production :( would love to see how you plan to make online compaction efficient.
Adam Zweiger@AdamZweiger

We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.

English
1
4
45
4.3K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Future Work: - Integrating latent compaction into inference engines (e.g. RadixAttention, varlen storage, disaggregated compaction) - Online compaction — compacting mid-trajectory repeatedly to support arbitrarily long sequences. We show initial results but more work remains.
English
1
1
22
2.5K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Nice work on GAN-style training with a generator and discriminator, both trained with RL. This might be the path to improvement in domains without good verifiers like creative writing.
Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English
0
0
13
1.7K
Adam Zweiger
Adam Zweiger@AdamZweiger·
Presenting Self-Adapting Language Models on Wednesday at NeurIPS. We equip an LLM with the ability to write training data for itself in response to new inputs. We then meta-learn this ability with RL. Stop by to chat! 11-2 pm, #3415, with @jyo_pari @HanGuo97 @akyurekekin
Adam Zweiger tweet media
English
4
5
60
4.4K
Adam Zweiger retweetet
Zitong Yang
Zitong Yang@ZitongYang0·
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
Zitong Yang tweet media
English
10
46
255
41.2K