Dawid Kopiczko: "Common knowledge in ML: more unique training data → better generalization. Turns"

Post

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

English

312

19.9K

Dawid Kopiczko@dawkopi·16 Şub

When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.

English

1.3K

Dawid Kopiczko@dawkopi·16 Şub

Part of the story here is termination rate. Models trained with more epochs are much better at concluding their reasoning. With 1 epoch on 51K samples and a token limit of 32K, the model only concludes 24% of the time. With 32 epochs on 1600 samples, that jumps to 89%.

English

917

Dawid Kopiczko@dawkopi·16 Şub

So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.

English

858

Dawid Kopiczko@dawkopi·16 Şub

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: arxiv.org/abs/2602.11149 Code: github.com/dkopi/data-rep…

English

1.2K

Chinmay Kak@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+

English

Chinmay Kak@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano + better. Would be interesting to see if rephrasing the same dataset and making it larger helps with one epoch instead of more epochs

English

Dawid Kopiczko@dawkopi·17 Şub

@ChinmayKak actually something similar was observed by @AlexGDimakis when working on OpenThoughts dataset; sampling multiple trajectories for the same prompt, instead of drawing more unique prompts led to better results x.com/AlexGDimakis/s…

Alex Dimakis@AlexGDimakis

The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

English

Chinmay Kak@ChinmayKak·17 Şub

@dawkopi @AlexGDimakis This is super interesting! Thanks:) Even kimi has similar results for rephrasing but in terms of pretraining tokens

English

Paylaş