Jacob Springer

126 posts

Jacob Springer

Jacob Springer

@jacspringer

PhD student @mldcmu

Katılım Temmuz 2016
302 Takip Edilen708 Takipçiler
Sabitlenmiş Tweet
Jacob Springer
Jacob Springer@jacspringer·
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9
Jacob Springer tweet media
English
17
175
812
165.1K
Jacob Springer retweetledi
Gaurav Ghosal
Gaurav Ghosal@gaurav_ghosal·
Had a great time working on this project exploring how to proactively prevent forgetting of capabilities during subsequent training! All credit goes to @lawrencefeng17 for leading it so skillfully!
Lawrence Feng@lawrencefeng17

1/ To retain post-training capabilities after further fine-tuning, mix that data into pretraining. The effect can be invisible until fine-tuning begins; early exposure may not help post-training performance, but it changes what persists. How a model learns a task matters.

English
1
3
13
1.6K
Jacob Springer retweetledi
Lawrence Feng
Lawrence Feng@lawrencefeng17·
1/ To retain post-training capabilities after further fine-tuning, mix that data into pretraining. The effect can be invisible until fine-tuning begins; early exposure may not help post-training performance, but it changes what persists. How a model learns a task matters.
English
6
24
86
26.5K
Jacob Springer retweetledi
Aditi Raghunathan
Aditi Raghunathan@AdtRaghunathan·
It's one of the first lessons in ML: the model with the lowest train loss isn't the one that generalizes best. Pretraining made that easy to forget. You train for one epoch over trillions of tokens, there's no traditional overfitting, and pretrain loss starts to feel like the whole story. Our paper argues it isn't. The lowest-loss model isn't the best starting point for post-training. An old sharp-vs-flat lesson, back in a new regime.
Ishaan Watts@IshaanWatts18

Spending billions to train the "best" base model? You might be optimizing the wrong thing! 🎯 We show that controlling sharpness during mid-training leads to over 35% less forgetting after fine-tuning / quantization... even when the base model itself gets worse. 🧵 Takeaways for pretraining: - Use SAM (Sharpness-Aware-Minimization) in the final steps (~10%) - Try much higher learning rates (yes, even ~10× larger) 1/9

English
2
7
143
21.4K
Jacob Springer
Jacob Springer@jacspringer·
I also hope our work helps the open source model development community pre-train better models that are easier to fine-tune; would love to see some of this implemented in Marin @percyliang, OLMo @natolambert, or SmolLM @eliebakouch
English
0
0
4
133
Jacob Springer
Jacob Springer@jacspringer·
But I'm excited to see if we can do better. I would love to see a nanoGPT speedrun benchmark that evaluates models based on how well they can be post-trained. I suspect we'll learn that a lot of the optimization lessons we think we know end up being (at least subtly) wrong.
English
1
0
2
124
Jacob Springer
Jacob Springer@jacspringer·
Just released a new pretraining paper with some interesting takeaways: - sharpness minimization is important but it doesn’t show its benefit until *after* you post-train - increase your learning rate!! (this is free!) read Ishaan’s thread but I’ll also add my 2 cents below 1/n
Ishaan Watts@IshaanWatts18

Spending billions to train the "best" base model? You might be optimizing the wrong thing! 🎯 We show that controlling sharpness during mid-training leads to over 35% less forgetting after fine-tuning / quantization... even when the base model itself gets worse. 🧵 Takeaways for pretraining: - Use SAM (Sharpness-Aware-Minimization) in the final steps (~10%) - Try much higher learning rates (yes, even ~10× larger) 1/9

English
2
9
39
5.6K
Jacob Springer
Jacob Springer@jacspringer·
RT @IshaanWatts18: Obrigado Brazil! 🇧🇷 Had an incredible time at @iclr_conf talking about our work on pretraining optimization. I also had…
English
0
1
0
41
Jacob Springer retweetledi
Konwoo Kim
Konwoo Kim@konwookim·
for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute
Konwoo Kim tweet media
English
8
59
368
100.2K
Jacob Springer retweetledi
Christina Baek
Christina Baek@_christinabaek·
Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵
Christina Baek tweet media
English
19
80
617
94.2K
Jacob Springer retweetledi
Vaibhav Adlakha
Vaibhav Adlakha@vaibhav_adlakha·
Your LLM already knows the answer. Why is your embedding model still encoding the question? 🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text. 🏆 SOTA self-supervised embeddings 🛡️ Free transfer of instruction-following, safety, and reasoning
GIF
English
5
37
193
50.4K
Jacob Springer retweetledi
Suhas Kotha
Suhas Kotha@kothasuhas·
to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang)
Suhas Kotha tweet media
English
15
64
498
73K
Jacob Springer
Jacob Springer@jacspringer·
the rank of llm representations / weights has recently been such a hot topic, with multiple papers arguing that rank is a good predictor of performance it turns out, our paper shows it's mainly hyperparameters that determine the rank! read Atharva's thread ↓
Jacob Springer tweet media
Atharva Kulkarni@athrvkk

Is the geometry of language model weights really predictive of performance?🔍 Our new work challenges the popular hypothesis that low rank unembedding matrix hurts LLM performance; and the answer is more complicated than you'd think! arxiv.org/abs/2602.20433 (1/8)

English
0
1
4
152
Jacob Springer retweetledi
Ziqian Zhong
Ziqian Zhong@fjzzq2002·
🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.
English
28
155
1.1K
73.8K
Jacob Springer retweetledi
Fahim Tajwar
Fahim Tajwar@FahimTajwar10·
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n
English
14
161
807
207.5K
Jacob Springer retweetledi
Yuda Song
Yuda Song@yus167·
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
Yuda Song tweet media
English
14
102
601
107.2K
Jacob Springer retweetledi
Vaishnavh Nagarajan
Vaishnavh Nagarajan@_vaishnavh·
1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."
GIF
English
58
246
1.5K
92K
Jacob Springer
Jacob Springer@jacspringer·
Ironically hard drive space is the limiting factor in a lot of my experiments, not GPUs. (life of a hard-drive-poor phd student)
English
0
0
1
144
Jacob Springer
Jacob Springer@jacspringer·
Whoever decided tokenizers should have a vocab size of ~128,000 (2^17) was clearly not an engineer because now I have to store tokens as uint32 instead of uint16. One bit less and it would use half the disk space.
English
1
0
2
261