Dawid Kopiczko

19

1.2K

Dawid Kopiczko@dawkopi·17 Şub

@DimitrisPapail added all 45 checkpoints of Olmo3-7B here: huggingface.co/dakopi/olmo3-7…

English

1

9

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail (I'm uploading all ckpts to HF, so it will be possible to play with models trained with diff epoch-samples ratio)

English

Alex Dimakis@AlexGDimakis

0

1

23

Dimitris Papailiopoulos@DimitrisPapail·6 Oca

1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?" Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating! We study why this🔁 happens and why increasing temp is a band-aid

English

26

85

747

93.6K

Dawid Kopiczko@dawkopi·17 Şub

@ChinmayKak actually something similar was observed by @AlexGDimakis when working on OpenThoughts dataset; sampling multiple trajectories for the same prompt, instead of drawing more unique prompts led to better results x.com/AlexGDimakis/s…

The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

English

0

1

41

Chinmay Kak@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano + better. Would be interesting to see if rephrasing the same dataset and making it larger helps with one epoch instead of more epochs

English

0

48

Dawid Kopiczko@dawkopi·16 Şub

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

English

3

26

312

19.9K

Dawid Kopiczko@dawkopi·17 Şub

@ChinmayKak @Sagar_Vaze @TiRune @y_m_asano there's this work on data-constrained pretraining (arxiv.org/abs/2305.16264), where they show that multiple epochs can substitute unique data *up to a few epochs*; while more epochs slows down convergence, at least measured by val loss

English

2

43

Chinmay Kak@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+

English

0

96

Dawid Kopiczko@dawkopi·17 Şub

@ysu_ChatData total tokens seen during training is more or less the same within each update budget; as we report in the paper -- there are "standard" overfitting signs like train set memorization and rising val loss, but the model generalizes well nevertheless

English

166

Yongrui Su@ysu_ChatData·17 Şub

Interesting. My guess is it is more about optimizing on a stable set of reasoning patterns than covering more surface area. Did you control for total tokens seen versus steps, and do you see any overfitting signals like worse out of distribution or shorter chains of thought? Also curious if curriculum style mixing changes the result.

English

0

361

Dawid Kopiczko@dawkopi·17 Şub

@yacinelearning as RAFT is basically SFT on filtered (on-policy) data, you might find this phenomenon interesting: x.com/dawkopi/status…

Dawid Kopiczko@dawkopi

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

English

1

2

132

Yacine Mahdid@yacinelearning·17 Şub

this fine wintery thursday we are going to review the RAFT algorithm for LLM finetuning also colloquially referred as "rejection sampling fine-tuning" now you might be a bit confused and ask "yacine why are we reviewing for like 3h an algorithm from 2023 for model alignment???"

English

7

2

68

3.5K

Dawid Kopiczko@dawkopi·17 Şub

(too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples") 16 epochs on 400 samples yields: 83% -- AIME'24, 63% -- AIME'25, 66% -- GPQA.

English

62

Dawid Kopiczko@dawkopi·16 Şub

When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.

English

0

18

1.3K

Dawid Kopiczko@dawkopi·16 Şub

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: arxiv.org/abs/2602.11149 Code: github.com/dkopi/data-rep…

English

19

1.2K

Dawid Kopiczko@dawkopi·16 Şub

So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.

English

0

14

858

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail there were *signs* that the phenomenon exists tho -- many tech reports mention multiple epochs in SFT stage, or eg. this paper (arxiv.org/abs/2502.03387) training for 15 epochs on 800 samples; but they focus on data quality, and do not ablate epochs

English

1

51

Dimitris Papailiopoulos@DimitrisPapail·16 Şub

@dawkopi ok reading a bit deeper. this seems kinda crazy :) eg that 16-32 epochs are working so well and area kinda optimal?

English

0

36

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail yup, for 7-8B models and this dataset it seems optimal; but for example 4B model gets saturated around 4-8; it's either implicitly due to smaller model, or explicitly due to larger optimal learning rate (3e-5 vs 2e-5 for 7-8B models)

English

23

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail yeah, it looks like standard overfitting as val loss goes up, while train loss goes to 0 -- but the model generalizes well; we can train on 200 generic conversation samples which demonstrate reasoning patterns, and the model starts solving ~40% of AIME'25 problems

English

14

Dawid Kopiczko@dawkopi·16 Şub

@AlexGDimakis might be related to our recent findings: arxiv.org/abs/2602.11149 in short, for a fixed compute budget, LMs benefit from repeated exposure to the same data -- reasoning trajectories conclude more often, and benchmark scores are *a lot* better

English

1

25

Alex Dimakis@AlexGDimakis·7 Ara

The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

English

14

25

215

28.2K

Dawid Kopiczko@dawkopi·13 Şub

@a_weers here's a known example of the "repetition advantage": arxiv.org/abs/2502.03387 they get really good performance by training on 800 samples with *15 epochs*; but the paper focuses on data quality, and doesn't ablate the epoch count -- the main driver of gains

English

0

1

23

Alex Weers@a_weers·13 Şub

@dawkopi Good point, and thanks for the suggested paper! That is a surprising result (at least to me), but it is a great finding! Also amazing that you use OLMO (+Qwen), that makes it very trustworthy :)

English

0

2

129

Alex Weers@a_weers·12 Şub

Finished reading, here is the summary: SFT before RL usually improves performance or at least saves time/compute. However, when SFT isn't the last stage, measuring a "good" or "successful" SFT becomes non-obvious. Instead of optimizing for best performance on downstream tasks, SFT should prepare the model for the subsequent RL stage. This paper shows that strong SFT performance is indeed a bad indicator for success in the subsequent RL stage, with stronger SFT checkpoints often underperforming weaker ones after RL. The authors introduce a simple weighting scheme during SFT based on the similarity between the behavior policy (the one that generated the SFT data) and the model policy. The key insight: if the continuation after a token is implausible under the current model, learning from that token provides little useful signal, since RL will sample from its own distribution and rarely visit those trajectories anyway. PEAR down-weights such tokens. This weighting can operate at the sequence, block, or token level, trading granularity for stability. They successfully demonstrate that this weighting helps. Interestingly, their experiments use a Qwen model for SFT data generation while training other Qwen models. I was surprised by this, since I expected those policies to already be well aligned. The only drawback is that the method requires access to the SFT data generation policy. This is only available for synthetic data, and even then, it incurs a non-negligible cost. However, I like the idea and think optimizing SFT with respect to subsequent RL stages is a good motivation. Great paper by @dylan_works_ et al.!

Alex Weers@a_weers

Today's first (of two) read will be on how to optimize the SFT stage for better subsequent RL performance Apparently strong SFT results are not the optimal starting point for RL, let's find out more

English

4

76

9.3K

Dawid Kopiczko@dawkopi·13 Şub

@a_weers yeah, the results are a bit of counterintuitive, but seem to be robust for long-cot data; I'll upload all the checkpoints to hf hub soon, so it can be easily verified 👌

English

@a_weers arxiv.org/abs/2602.11149

0

1

59

Dawid Kopiczko@dawkopi·13 Şub

QME

2

28

Dawid Kopiczko@dawkopi·13 Şub

@a_weers curious how the performance of this method looks when training for multiple epochs; it seems that standard SFT benefits *a lot* from repetition; e.g. with Qwen3-8B-Base you can get up to ~31% on AIME24/25 solely via SFT, without RL, if you do enough epochs

English