Jacob Springer

126 posts

Jacob Springer

@jacspringer

PhD student @mldcmu

Katılım Temmuz 2016

302 Takip Edilen708 Takipçiler

Sabitlenmiş Tweet

Jacob Springer@jacspringer·26 Mar

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

English

175

812

165.1K

Jacob Springer retweetledi

Gaurav Ghosal@gaurav_ghosal·19 May

Had a great time working on this project exploring how to proactively prevent forgetting of capabilities during subsequent training! All credit goes to @lawrencefeng17 for leading it so skillfully!

Lawrence Feng@lawrencefeng17

1/ To retain post-training capabilities after further fine-tuning, mix that data into pretraining. The effect can be invisible until fine-tuning begins; early exposure may not help post-training performance, but it changes what persists. How a model learns a task matters.

English

1.6K

Jacob Springer retweetledi

Lawrence Feng@lawrencefeng17·19 May

English

26.5K

Jacob Springer retweetledi

Aditi Raghunathan@AdtRaghunathan·12 May

It's one of the first lessons in ML: the model with the lowest train loss isn't the one that generalizes best. Pretraining made that easy to forget. You train for one epoch over trillions of tokens, there's no traditional overfitting, and pretrain loss starts to feel like the whole story. Our paper argues it isn't. The lowest-loss model isn't the best starting point for post-training. An old sharp-vs-flat lesson, back in a new regime.

Ishaan Watts@IshaanWatts18

Spending billions to train the "best" base model? You might be optimizing the wrong thing! 🎯 We show that controlling sharpness during mid-training leads to over 35% less forgetting after fine-tuning / quantization... even when the base model itself gets worse. 🧵 Takeaways for pretraining: - Use SAM (Sharpness-Aware-Minimization) in the final steps (~10%) - Try much higher learning rates (yes, even ~10× larger) 1/9

English

143

21.4K

Jacob Springer@jacspringer·8 May

I also hope our work helps the open source model development community pre-train better models that are easier to fine-tune; would love to see some of this implemented in Marin @percyliang, OLMo @natolambert, or SmolLM @eliebakouch

English

133

Jacob Springer@jacspringer·8 May

But I'm excited to see if we can do better. I would love to see a nanoGPT speedrun benchmark that evaluates models based on how well they can be post-trained. I suspect we'll learn that a lot of the optimization lessons we think we know end up being (at least subtly) wrong.

English

124

Jacob Springer@jacspringer·8 May

Just released a new pretraining paper with some interesting takeaways: - sharpness minimization is important but it doesn’t show its benefit until *after* you post-train - increase your learning rate!! (this is free!) read Ishaan’s thread but I’ll also add my 2 cents below 1/n

Ishaan Watts@IshaanWatts18

English

5.6K

Jacob Springer@jacspringer·1 May

RT @IshaanWatts18: Obrigado Brazil! 🇧🇷 Had an incredible time at @iclr_conf talking about our work on pretraining optimization. I also had…

English

Jacob Springer retweetledi

Konwoo Kim@konwookim·20 Mar

for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute

English

368

100.2K

Jacob Springer retweetledi

Christina Baek@_christinabaek·18 Mar

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

English

617

94.2K

Jacob Springer retweetledi

Vaibhav Adlakha@vaibhav_adlakha·12 Mar

Your LLM already knows the answer. Why is your embedding model still encoding the question? 🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text. 🏆 SOTA self-supervised embeddings 🛡️ Free transfer of instruction-following, safety, and reasoning

GIF

English

193

50.4K

Jacob Springer retweetledi

Suhas Kotha@kothasuhas·6 Mar

to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang)

English

498

73K

Jacob Springer@jacspringer·4 Mar

the rank of llm representations / weights has recently been such a hot topic, with multiple papers arguing that rank is a good predictor of performance it turns out, our paper shows it's mainly hyperparameters that determine the rank! read Atharva's thread ↓

Atharva Kulkarni@athrvkk

Is the geometry of language model weights really predictive of performance?🔍 Our new work challenges the popular hypothesis that low rank unembedding matrix hurts LLM performance; and the answer is more complicated than you'd think! arxiv.org/abs/2602.20433 (1/8)

English

152

Jacob Springer retweetledi

Ziqian Zhong@fjzzq2002·20 Şub

🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.

English

155

1.1K

73.8K

Jacob Springer retweetledi

Fahim Tajwar@FahimTajwar10·5 Şub

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n

English

161

807

207.5K

Jacob Springer retweetledi

Yuda Song@yus167·3 Şub

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

102

601

107.2K

Jacob Springer retweetledi

Vaishnavh Nagarajan@_vaishnavh·8 Oca

1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."

GIF

English

246

1.5K

92K

Jacob Springer@jacspringer·30 Eki

Ironically hard drive space is the limiting factor in a lot of my experiments, not GPUs. (life of a hard-drive-poor phd student)

English

144

Jacob Springer@jacspringer·30 Eki

Whoever decided tokenizers should have a vocab size of ~128,000 (2^17) was clearly not an engineer because now I have to store tokens as uint32 instead of uint16. One bit less and it would use half the disk space.

English

261

Keşfet

@lawrencefeng17 @percyliang @natolambert @eliebakouch @IshaanWatts18 @iclr_conf @elonmusk @BarackObama