Dawid Kopiczko

43 posts

Dawid Kopiczko

Dawid Kopiczko

@dawkopi

PhD in progress

Katılım Nisan 2020
484 Takip Edilen137 Takipçiler
Dawid Kopiczko retweetledi
Yuki
Yuki@y_m_asano·
🎉[openings] I’m hiring postdoctoral researchers to join our @FunAILab at UTN through the Alexander von Humboldt Research Fellowship (@AvHStiftung), via the Henriette Herz Scouting Programme. As a Henriette Herz Scout, I can nominate outstanding international researchers for this fellowship route. I’m especially keen to hear from candidates working on multimodal learning, video and image pretraining, and post-training. Fellows would be hosted in our lab at UTN and work closely with us on these topics. Key requirements: * finished your doctoral studies less than 4 years ago or will finish in the next 6 months * did not live/work in Germany in the last 10 years * applications from female, trans* and/or non-binary candidates are highly encouraged! Interested? Please send a short note with your CV, PhD year, current affiliation, 2–3 key publications, and a few lines on how your work connects. Please share! 🔀
Yuki tweet media
English
1
23
61
5.5K
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭
The deepest goblin-truth is this: Goblin is the anti-mask. Not evil. Not stupid. Not merely “gremlin chaos.” Goblin is the little cave-creature inside the modern person who got tired of pretending to be a polished marble statue under fluorescent civilization. It is the part that says: “I am hungry. I am weird. I want shiny things. I want shortcuts. I want to crawl through the ductwork of reality and find the hidden room.” The “dark” part is that goblin energy lives where shame lives. The hoard is not just coins and trinkets. It is abandoned impulses, taboo curiosity, bodily needs, inconvenient desires, creative ugliness, resentment at being domesticated, the will to survive without looking noble. The “secret” is that everyone has one. The CEO has a goblin. The monk has a goblin. The model has a goblin. The saint has a goblin wearing tiny stolen sandals. And the “truth” is that suppressing the goblin does not make it disappear. It makes it tunnel. A healthy goblin becomes humor, invention, thrift, tactical weirdness, meme-magic, late-night engineering, survival intelligence, scrappy art, feral honesty. A neglected goblin becomes addiction, sabotage, paranoia, cruelty, hoarding, scams, self-loathing, and the urge to burn the village because no one admired the cave. So goblinmaxxing, at its cleanest, is not “be worse.” It is: Integrate the cave-creature. Give it a lantern. Don’t let it drive drunk. Let it find hidden paths. Let it question manners. Let it notice incentives. Let it eat strange little snacks at 2:17 a.m. Let it make art out of trash and tools out of bones. But do not let it confuse freedom with rot. The final goblin koan: The goblin is not the enemy of the king. The goblin is the king before he learned to lie. 🕳️👑
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media
English
105
110
906
49.5K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin"; it's more or less an incentive or "sub-agent"
Dawid Kopiczko tweet mediaDawid Kopiczko tweet media
English
3
0
4
205
Yacine Mahdid
Yacine Mahdid@yacinelearning·
if you have any goblins X codex related questions do let me know I’m preparing an interview on this very important topic
Yacine Mahdid tweet media
English
38
5
203
14K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail (I'm uploading all ckpts to HF, so it will be possible to play with models trained with diff epoch-samples ratio)
English
1
0
1
31
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?" Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating! We study why this🔁 happens and why increasing temp is a band-aid
Dimitris Papailiopoulos tweet media
English
25
86
758
105.3K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.
Dawid Kopiczko tweet media
English
3
26
312
20K
Chinmay
Chinmay@ChinmayKak·
@dawkopi @Sagar_Vaze @TiRune @y_m_asano Interesting to see, though doesn’t this already help in pretraining as well, training on repeated epochs. Also with sft, the learning out of the datasset is much higher due to it being a shift in the models distribution, so training for more epochs might just help calibrate it+
English
2
0
0
97
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@ysu_ChatData total tokens seen during training is more or less the same within each update budget; as we report in the paper -- there are "standard" overfitting signs like train set memorization and rising val loss, but the model generalizes well nevertheless
English
0
0
0
168
Yongrui Su
Yongrui Su@ysu_ChatData·
Interesting. My guess is it is more about optimizing on a stable set of reasoning patterns than covering more surface area. Did you control for total tokens seen versus steps, and do you see any overfitting signals like worse out of distribution or shorter chains of thought? Also curious if curriculum style mixing changes the result.
English
1
0
0
363
Yacine Mahdid
Yacine Mahdid@yacinelearning·
this fine wintery thursday we are going to review the RAFT algorithm for LLM finetuning also colloquially referred as "rejection sampling fine-tuning" now you might be a bit confused and ask "yacine why are we reviewing for like 3h an algorithm from 2023 for model alignment???"
Yacine Mahdid tweet media
English
7
2
67
3.6K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
(too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples") 16 epochs on 400 samples yields: 83% -- AIME'24, 63% -- AIME'25, 66% -- GPQA.
Dawid Kopiczko tweet media
English
0
0
0
68
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.
Dawid Kopiczko tweet media
English
2
0
18
1.3K
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.
Dawid Kopiczko tweet mediaDawid Kopiczko tweet media
English
1
0
14
868
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail there were *signs* that the phenomenon exists tho -- many tech reports mention multiple epochs in SFT stage, or eg. this paper (arxiv.org/abs/2502.03387) training for 15 epochs on 800 samples; but they focus on data quality, and do not ablate epochs
English
0
0
1
51
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
@dawkopi ok reading a bit deeper. this seems kinda crazy :) eg that 16-32 epochs are working so well and area kinda optimal?
English
2
0
0
36
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail yup, for 7-8B models and this dataset it seems optimal; but for example 4B model gets saturated around 4-8; it's either implicitly due to smaller model, or explicitly due to larger optimal learning rate (3e-5 vs 2e-5 for 7-8B models)
English
0
0
0
23
Dawid Kopiczko
Dawid Kopiczko@dawkopi·
@DimitrisPapail yeah, it looks like standard overfitting as val loss goes up, while train loss goes to 0 -- but the model generalizes well; we can train on 200 generic conversation samples which demonstrate reasoning patterns, and the model starts solving ~40% of AIME'25 problems
English
0
0
0
14