Ricardo Olmedo

126 posts

Ricardo Olmedo

@rdolmedo_

PhD student @MPI_IS, working with Moritz Hardt and Bernhard Schölkopf | Currently visiting @Stanford

Stanford Katılım Ocak 2014

321 Takip Edilen700 Takipçiler

Ricardo Olmedo@rdolmedo_·1s

For a rigorous treatment of these topics, see Moritz Hardt's book on the science of machine learning benchmarks mlbenchmarks.org

English

Ricardo Olmedo@rdolmedo_·1s

When properly adjusting for task adaptation, model rankings even transfer across benchmarks. This surfaces an important axis of model capability: being adaptable to downstream tasks of interest. arxiv.org/abs/2507.05195

English

Ricardo Olmedo@rdolmedo_·1s

That being said, 🗣️model comparisons are scientifically uninformative unless we control for test task adaptation 🗣️ Absolute benchmark performance is nearly meaningless. Rate of progress on newly proposed benchmarks should be taken with a grain of salt. arxiv.org/abs/2407.07890

English

Ricardo Olmedo@rdolmedo_·2s

Counterintuitively, the more attention the community places on a benchmark, the more we can trust who is at the top. If everyone is trying their best (and has comparably hefty wallets), the only path to the top is genuine technical progress.

English

Ricardo Olmedo@rdolmedo_·2s

Model comparisons are confounded by benchmark-specific adaptation. But‼️when frontier labs fiercely compete on the hottest benchmark of the year, and billions are on the line, everyone benchmaxes. Skill and ingenuity are then the only differentiators.

English

Ricardo Olmedo@rdolmedo_·3s

Claude 3 Opus scored 4% on SWE-bench at release. Shockingly, a Pythia-scale model trained **only on pre-1931 data**, with a bit of fine-tuning, outperforms the April 2024 SOTA. Clearly, Opus is the better model. Why should we care about benchmarks, then? 👇🧵

Ricardo Olmedo@rdolmedo_

We fine-tuned Alec Radford’s 1930 vintage LLM to solve SWE-bench issues. After just ‼️250‼️ training examples, the model solves its first issue, a simple patch to the xarray library. 🧵👇

English

Ricardo Olmedo@rdolmedo_·17h

@dejavucoder credit to @wanon77789

English

109

sankalp@dejavucoder·17h

noooooooo stop code maxxing on my vintage LLM

Ricardo Olmedo@rdolmedo_

English

2.2K

Ricardo Olmedo@rdolmedo_·21h

@Zakarth Post-training is all about aligning models with downstream tasks. Here the question is, what capabilities does pre-training on the internet give you that are not recoverable with a little post-training?

English

1.4K

Zakarth@Zakarth·22h

@rdolmedo_ Ok but if you’re fine tuning it to do a specific task that defeats the point.

English

1.5K

Ricardo Olmedo@rdolmedo_·1d

English

1.1K

190.8K

Ricardo Olmedo@rdolmedo_·22h

@BlackHC However, the teacher model is Qwen Coder 3B active (the old one). With a stronger teacher, one should get even better results.

English

544

Ricardo Olmedo@rdolmedo_·22h

@BlackHC The seed dataset used for fine-tuning is SWE-smith, from the SWE-bench folks swesmith.com

English

Ricardo Olmedo@rdolmedo_·22h

@wanon77789 Haha amazing, I haven't tried it with Claude Code, I would recommend the mini-swe-agent scaffolding since that's what it was fine-tuned on

English

3.1K

buge4@wanon77789·22h

@rdolmedo_ This shit goes crazy in claude code 😭

English

144

7.2K

Ricardo Olmedo@rdolmedo_·1d

@max_tretikov Both the 1930 and web models have the same pre-training FLOPs.

English

maxtretikov@max_tretikov·1d

@rdolmedo_ Hold on, wait. Didn't microsoft try to do "textbooks are all you need" a few years ago and find that curation doesn't replace scale? How is this possible?

English

2.3K

Ricardo Olmedo@rdolmedo_·1d

@willdepue I find it surprising that 250 training trajectories are sufficient to produce this level of multi-turn agentic reasoning, especially given that the model was not mid-trained on trillions of reasoning tokens.

English

440

Ricardo Olmedo@rdolmedo_·1d

@willdepue More impressive than the patch itself is that the model initially fails at applying the patch (turn 12), but runs tests, realizes its mistake, and finally fixes it (turn 44). ricardodominguez.github.io/blogs/pydata__…

English

2.9K

Ricardo Olmedo@rdolmedo_·1d

@akshatgupta57 Pass@100 on HumanEval btw, not agentic

English