Dawid Kopiczko

39 posts

Dawid Kopiczko

Dawid Kopiczko

@dawkopi

가입일 Nisan 2020
459 팔로잉134 팔로워
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail (I'm uploading all ckpts to HF, so it will be possible to play with models trained with diff epoch-samples ratio)
English
1
0
1
23
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?" Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating! We study why this🔁 happens and why increasing temp is a band-aid
Dimitris Papailiopoulos tweet media
English
26
85
747
93.6K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.
Dawid Kopiczko tweet media
English
3
26
312
19.9K
Chinmay Kak
Chinmay Kak@ChinmayKak·
@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+
English
2
0
0
96
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@ysu_ChatData total tokens seen during training is more or less the same within each update budget; as we report in the paper -- there are "standard" overfitting signs like train set memorization and rising val loss, but the model generalizes well nevertheless
English
0
0
0
166
Yongrui Su
Yongrui Su@ysu_ChatData·
Interesting. My guess is it is more about optimizing on a stable set of reasoning patterns than covering more surface area. Did you control for total tokens seen versus steps, and do you see any overfitting signals like worse out of distribution or shorter chains of thought? Also curious if curriculum style mixing changes the result.
English
1
0
0
361
Yacine Mahdid
Yacine Mahdid@yacinelearning·
this fine wintery thursday we are going to review the RAFT algorithm for LLM finetuning also colloquially referred as "rejection sampling fine-tuning" now you might be a bit confused and ask "yacine why are we reviewing for like 3h an algorithm from 2023 for model alignment???"
Yacine Mahdid tweet media
English
7
2
68
3.5K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
(too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples") 16 epochs on 400 samples yields: 83% -- AIME'24, 63% -- AIME'25, 66% -- GPQA.
Dawid Kopiczko tweet media
English
0
0
0
62
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.
Dawid Kopiczko tweet media
English
2
0
18
1.3K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.
Dawid Kopiczko tweet mediaDawid Kopiczko tweet media
English
1
0
14
858
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail there were *signs* that the phenomenon exists tho -- many tech reports mention multiple epochs in SFT stage, or eg. this paper (arxiv.org/abs/2502.03387) training for 15 epochs on 800 samples; but they focus on data quality, and do not ablate epochs
English
0
0
1
51
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
@dawkopi ok reading a bit deeper. this seems kinda crazy :) eg that 16-32 epochs are working so well and area kinda optimal?
English
2
0
0
36
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail yup, for 7-8B models and this dataset it seems optimal; but for example 4B model gets saturated around 4-8; it's either implicitly due to smaller model, or explicitly due to larger optimal learning rate (3e-5 vs 2e-5 for 7-8B models)
English
0
0
0
23
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail yeah, it looks like standard overfitting as val loss goes up, while train loss goes to 0 -- but the model generalizes well; we can train on 200 generic conversation samples which demonstrate reasoning patterns, and the model starts solving ~40% of AIME'25 problems
English
0
0
0
14
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@AlexGDimakis might be related to our recent findings: arxiv.org/abs/2602.11149 in short, for a fixed compute budget, LMs benefit from repeated exposure to the same data -- reasoning trajectories conclude more often, and benchmark scores are *a lot* better
English
0
0
1
25
Alex Dimakis
Alex Dimakis@AlexGDimakis·
The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.
Alex Dimakis tweet media
English
14
25
215
28.2K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@a_weers here's a known example of the "repetition advantage": arxiv.org/abs/2502.03387 they get really good performance by training on 800 samples with *15 epochs*; but the paper focuses on data quality, and doesn't ablate the epoch count -- the main driver of gains
English
1
0
1
23
Alex Weers
Alex Weers@a_weers·
@dawkopi Good point, and thanks for the suggested paper! That is a surprising result (at least to me), but it is a great finding! Also amazing that you use OLMO (+Qwen), that makes it very trustworthy :)
English
2
0
2
129
Alex Weers
Alex Weers@a_weers·
Finished reading, here is the summary: SFT before RL usually improves performance or at least saves time/compute. However, when SFT isn't the last stage, measuring a "good" or "successful" SFT becomes non-obvious. Instead of optimizing for best performance on downstream tasks, SFT should prepare the model for the subsequent RL stage. This paper shows that strong SFT performance is indeed a bad indicator for success in the subsequent RL stage, with stronger SFT checkpoints often underperforming weaker ones after RL. The authors introduce a simple weighting scheme during SFT based on the similarity between the behavior policy (the one that generated the SFT data) and the model policy. The key insight: if the continuation after a token is implausible under the current model, learning from that token provides little useful signal, since RL will sample from its own distribution and rarely visit those trajectories anyway. PEAR down-weights such tokens. This weighting can operate at the sequence, block, or token level, trading granularity for stability. They successfully demonstrate that this weighting helps. Interestingly, their experiments use a Qwen model for SFT data generation while training other Qwen models. I was surprised by this, since I expected those policies to already be well aligned. The only drawback is that the method requires access to the SFT data generation policy. This is only available for synthetic data, and even then, it incurs a non-negligible cost. However, I like the idea and think optimizing SFT with respect to subsequent RL stages is a good motivation. Great paper by @dylan_works_ et al.!
Alex Weers@a_weers

Today's first (of two) read will be on how to optimize the SFT stage for better subsequent RL performance Apparently strong SFT results are not the optimal starting point for RL, let's find out more

English
4
4
76
9.3K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@a_weers yeah, the results are a bit of counterintuitive, but seem to be robust for long-cot data; I'll upload all the checkpoints to hf hub soon, so it can be easily verified 👌
English
2
0
1
59
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@a_weers curious how the performance of this method looks when training for multiple epochs; it seems that standard SFT benefits *a lot* from repetition; e.g. with Qwen3-8B-Base you can get up to ~31% on AIME24/25 solely via SFT, without RL, if you do enough epochs
English
2
0
1
139