Post

Dawid Kopiczko
Dawid Kopiczko@dawkopi·
Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.
Dawid Kopiczko tweet media
English
3
26
312
19.9K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.
Dawid Kopiczko tweet media
English
2
0
18
1.3K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
Part of the story here is termination rate. Models trained with more epochs are much better at concluding their reasoning. With 1 epoch on 51K samples and a token limit of 32K, the model only concludes 24% of the time. With 32 epochs on 1600 samples, that jumps to 89%.
Dawid Kopiczko tweet media
English
1
0
13
917
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.
Dawid Kopiczko tweet mediaDawid Kopiczko tweet media
English
1
0
14
858
Chinmay Kak
Chinmay Kak@ChinmayKak·
@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+
English
2
0
0
96
Chinmay Kak
Chinmay Kak@ChinmayKak·
@dawkopi @AlexGDimakis This is super interesting! Thanks:) Even kimi has similar results for rephrasing but in terms of pretraining tokens
English
0
0
1
36
Paylaş