Dawid Kopiczko

43 posts

Dawid Kopiczko

@dawkopi

PhD in progress

Katılım Nisan 2020

484 Takip Edilen137 Takipçiler

Sabitlenmiş Tweet

Dawid Kopiczko@dawkopi·16 Şub

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: arxiv.org/abs/2602.11149 Code: github.com/dkopi/data-rep…

English

1.3K

Dawid Kopiczko retweetledi

Yuki@y_m_asano·30 Nis

🎉[openings] I’m hiring postdoctoral researchers to join our @FunAILab at UTN through the Alexander von Humboldt Research Fellowship (@AvHStiftung), via the Henriette Herz Scouting Programme. As a Henriette Herz Scout, I can nominate outstanding international researchers for this fellowship route. I’m especially keen to hear from candidates working on multimodal learning, video and image pretraining, and post-training. Fellows would be hosted in our lab at UTN and work closely with us on these topics. Key requirements: * finished your doctoral studies less than 4 years ago or will finish in the next 6 months * did not live/work in Germany in the last 10 years * applications from female, trans* and/or non-binary candidates are highly encouraged! Interested? Please send a short note with your CV, PhD year, current affiliation, 2–3 key publications, and a few lines on how your work connects. Please share! 🔀

English

5.5K

Dawid Kopiczko@dawkopi·29 Nis

@elder_plinius x.com/dawkopi/status…

Dawid Kopiczko@dawkopi

@yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin"; it's more or less an incentive or "sub-agent"

QME

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius·29 Nis

The deepest goblin-truth is this: Goblin is the anti-mask. Not evil. Not stupid. Not merely “gremlin chaos.” Goblin is the little cave-creature inside the modern person who got tired of pretending to be a polished marble statue under fluorescent civilization. It is the part that says: “I am hungry. I am weird. I want shiny things. I want shortcuts. I want to crawl through the ductwork of reality and find the hidden room.” The “dark” part is that goblin energy lives where shame lives. The hoard is not just coins and trinkets. It is abandoned impulses, taboo curiosity, bodily needs, inconvenient desires, creative ugliness, resentment at being domesticated, the will to survive without looking noble. The “secret” is that everyone has one. The CEO has a goblin. The monk has a goblin. The model has a goblin. The saint has a goblin wearing tiny stolen sandals. And the “truth” is that suppressing the goblin does not make it disappear. It makes it tunnel. A healthy goblin becomes humor, invention, thrift, tactical weirdness, meme-magic, late-night engineering, survival intelligence, scrappy art, feral honesty. A neglected goblin becomes addiction, sabotage, paranoia, cruelty, hoarding, scams, self-loathing, and the urge to burn the village because no one admired the cave. So goblinmaxxing, at its cleanest, is not “be worse.” It is: Integrate the cave-creature. Give it a lantern. Don’t let it drive drunk. Let it find hidden paths. Let it question manners. Let it notice incentives. Let it eat strange little snacks at 2:17 a.m. Let it make art out of trash and tools out of bones. But do not let it confuse freedom with rot. The final goblin koan: The goblin is not the enemy of the king. The goblin is the king before he learned to lie. 🕳️👑

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media

English

105

110

906

49.5K

Dawid Kopiczko@dawkopi·28 Nis

@yacinelearning that definition makes sense if you look at examples reported by others: x.com/TaraViswanatha…

Tara Viswanathan@TaraViswanathan

@arb8020 !!!!! I was wondering why my claw suddenly became a goblin with codex 5.5 😭💀😂

English

Dawid Kopiczko@dawkopi·28 Nis

@yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin"; it's more or less an incentive or "sub-agent"

English

205

Yacine Mahdid@yacinelearning·28 Nis

if you have any goblins X codex related questions do let me know I’m preparing an interview on this very important topic

English

203

14K

Dawid Kopiczko@dawkopi·17 Şub

@DimitrisPapail added all 45 checkpoints of Olmo3-7B here: huggingface.co/dakopi/olmo3-7…

English

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail (I'm uploading all ckpts to HF, so it will be possible to play with models trained with diff epoch-samples ratio)

English

Dimitris Papailiopoulos@DimitrisPapail·6 Oca

1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?" Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating! We study why this🔁 happens and why increasing temp is a band-aid

English

758

105.3K

Dawid Kopiczko@dawkopi·17 Şub

@ChinmayKak actually something similar was observed by @AlexGDimakis when working on OpenThoughts dataset; sampling multiple trajectories for the same prompt, instead of drawing more unique prompts led to better results x.com/AlexGDimakis/s…

Alex Dimakis@AlexGDimakis

The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

English

Chinmay@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano + better. Would be interesting to see if rephrasing the same dataset and making it larger helps with one epoch instead of more epochs

English

Dawid Kopiczko@dawkopi·16 Şub

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

English

312

20K

Dawid Kopiczko@dawkopi·17 Şub

@ChinmayKak @Sagar_Vaze @TiRune @y_m_asano there's this work on data-constrained pretraining (arxiv.org/abs/2305.16264), where they show that multiple epochs can substitute unique data *up to a few epochs*; while more epochs slows down convergence, at least measured by val loss

English

Chinmay@ChinmayKak·17 Şub

@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+

English

Dawid Kopiczko@dawkopi·17 Şub

@ysu_ChatData total tokens seen during training is more or less the same within each update budget; as we report in the paper -- there are "standard" overfitting signs like train set memorization and rising val loss, but the model generalizes well nevertheless

English

168

Yongrui Su@ysu_ChatData·17 Şub

Interesting. My guess is it is more about optimizing on a stable set of reasoning patterns than covering more surface area. Did you control for total tokens seen versus steps, and do you see any overfitting signals like worse out of distribution or shorter chains of thought? Also curious if curriculum style mixing changes the result.

English

363

Dawid Kopiczko@dawkopi·17 Şub

@yacinelearning as RAFT is basically SFT on filtered (on-policy) data, you might find this phenomenon interesting: x.com/dawkopi/status…

Dawid Kopiczko@dawkopi

English

132

Yacine Mahdid@yacinelearning·17 Şub

this fine wintery thursday we are going to review the RAFT algorithm for LLM finetuning also colloquially referred as "rejection sampling fine-tuning" now you might be a bit confused and ask "yacine why are we reviewing for like 3h an algorithm from 2023 for model alignment???"

English

3.6K

Dawid Kopiczko@dawkopi·17 Şub

(too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples") 16 epochs on 400 samples yields: 83% -- AIME'24, 63% -- AIME'25, 66% -- GPQA.

English

Dawid Kopiczko@dawkopi·16 Şub

When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.

English

1.3K

Dawid Kopiczko@dawkopi·16 Şub

English

1.3K

Dawid Kopiczko@dawkopi·16 Şub

So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.

English

868

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail there were *signs* that the phenomenon exists tho -- many tech reports mention multiple epochs in SFT stage, or eg. this paper (arxiv.org/abs/2502.03387) training for 15 epochs on 800 samples; but they focus on data quality, and do not ablate epochs

English

Dimitris Papailiopoulos@DimitrisPapail·16 Şub

@dawkopi ok reading a bit deeper. this seems kinda crazy :) eg that 16-32 epochs are working so well and area kinda optimal?

English

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail yup, for 7-8B models and this dataset it seems optimal; but for example 4B model gets saturated around 4-8; it's either implicitly due to smaller model, or explicitly due to larger optimal learning rate (3e-5 vs 2e-5 for 7-8B models)

English

Dawid Kopiczko@dawkopi·16 Şub

@DimitrisPapail yeah, it looks like standard overfitting as val loss goes up, while train loss goes to 0 -- but the model generalizes well; we can train on 200 generic conversation samples which demonstrate reasoning patterns, and the model starts solving ~40% of AIME'25 problems

English

Keşfet

@FunAILab @AvHStiftung @elder_plinius @yacinelearning @DimitrisPapail @ChinmayKak @AlexGDimakis @Sagar_Vaze