Ricardo Olmedo

126 posts

Ricardo Olmedo banner
Ricardo Olmedo

Ricardo Olmedo

@rdolmedo_

PhD student @MPI_IS, working with Moritz Hardt and Bernhard Schölkopf | Currently visiting @Stanford

Stanford Katılım Ocak 2014
321 Takip Edilen700 Takipçiler
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
For a rigorous treatment of these topics, see Moritz Hardt's book on the science of machine learning benchmarks mlbenchmarks.org
English
0
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
When properly adjusting for task adaptation, model rankings even transfer across benchmarks. This surfaces an important axis of model capability: being adaptable to downstream tasks of interest. arxiv.org/abs/2507.05195
English
1
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
That being said, 🗣️model comparisons are scientifically uninformative unless we control for test task adaptation 🗣️ Absolute benchmark performance is nearly meaningless. Rate of progress on newly proposed benchmarks should be taken with a grain of salt. arxiv.org/abs/2407.07890
English
1
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
Counterintuitively, the more attention the community places on a benchmark, the more we can trust who is at the top. If everyone is trying their best (and has comparably hefty wallets), the only path to the top is genuine technical progress.
English
1
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
Model comparisons are confounded by benchmark-specific adaptation. But‼️when frontier labs fiercely compete on the hottest benchmark of the year, and billions are on the line, everyone benchmaxes. Skill and ingenuity are then the only differentiators.
English
1
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
Claude 3 Opus scored 4% on SWE-bench at release. Shockingly, a Pythia-scale model trained **only on pre-1931 data**, with a bit of fine-tuning, outperforms the April 2024 SOTA. Clearly, Opus is the better model. Why should we care about benchmarks, then? 👇🧵
Ricardo Olmedo tweet media
Ricardo Olmedo@rdolmedo_

We fine-tuned Alec Radford’s 1930 vintage LLM to solve SWE-bench issues. After just ‼️250‼️ training examples, the model solves its first issue, a simple patch to the xarray library. 🧵👇

English
1
0
0
0
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
@Zakarth Post-training is all about aligning models with downstream tasks. Here the question is, what capabilities does pre-training on the internet give you that are not recoverable with a little post-training?
English
1
0
6
1.4K
Zakarth
Zakarth@Zakarth·
@rdolmedo_ Ok but if you’re fine tuning it to do a specific task that defeats the point.
English
1
0
5
1.5K
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
We fine-tuned Alec Radford’s 1930 vintage LLM to solve SWE-bench issues. After just ‼️250‼️ training examples, the model solves its first issue, a simple patch to the xarray library. 🧵👇
Ricardo Olmedo tweet media
English
23
77
1.1K
190.8K
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
@BlackHC However, the teacher model is Qwen Coder 3B active (the old one). With a stronger teacher, one should get even better results.
English
1
0
0
544
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
@wanon77789 Haha amazing, I haven't tried it with Claude Code, I would recommend the mini-swe-agent scaffolding since that's what it was fine-tuned on
English
0
0
9
3.1K
buge4
buge4@wanon77789·
@rdolmedo_ This shit goes crazy in claude code 😭
buge4 tweet media
English
7
23
144
7.2K
maxtretikov
maxtretikov@max_tretikov·
@rdolmedo_ Hold on, wait. Didn't microsoft try to do "textbooks are all you need" a few years ago and find that curation doesn't replace scale? How is this possible?
English
1
0
0
2.3K
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
@willdepue I find it surprising that 250 training trajectories are sufficient to produce this level of multi-turn agentic reasoning, especially given that the model was not mid-trained on trillions of reasoning tokens.
English
1
0
12
440