Sam Paech

1.2K posts

Sam Paech banner
Sam Paech

Sam Paech

@sam_paech

Evals @liquidai Maintainer of EQ-Bench https://t.co/Jy56OlHrP5 https://t.co/oRApPQwvWS

Melbourne, Australia Se unió Temmuz 2012
219 Siguiendo3.9K Seguidores
Sam Paech
Sam Paech@sam_paech·
@pigeon__s Yeah I will probably update it when the next gen sonnet comes out. I prefer to skip a few generations as it's expensive & time consuming to re-score the whole leaderboard.
English
3
0
10
360
ρ:ɡeσn
ρ:ɡeσn@pigeon__s·
hey @sam_paech can you finally fucking update your leaderboard to be judged by sonnet 4.6 in light of recent deprecation and preferably update it more frequently in the future as well since sonnet 4.8 probably is coming out within the next month
ρ:ɡeσn tweet mediaρ:ɡeσn tweet media
English
4
1
18
1.7K
Sam Paech
Sam Paech@sam_paech·
The Qwen3.5 models really took over the pareto for LLM-judging. Local models that are actually capable at data scoring is a huge accelerator imo.
Sam Paech tweet media
English
18
33
437
26.3K
Sam Paech
Sam Paech@sam_paech·
@emsi_kil3r @QuixiAI I'm testing via api, so that might be a difference vs what you are experiencing, if you are using chatgpt.com. What don't you like about 5.4's creative writing?
English
1
0
3
697
Emsi
Emsi@emsi_kil3r·
Lol. That's not possible. GPT models and gpt-5.4 in particular are notoriously bad at creative writing. To the point of being just horrible. Not sure what were the criteria but clearly there's something fishy going on or there was a special promot that activates creative mode in gpt.
English
1
0
18
825
Sam Paech
Sam Paech@sam_paech·
New results! GPT-5.4: Places 1st, 2nd & 3rd on creative writing, longform writing & EQ-Bench respectively Grok-4.20: Refusals everywhere, and consequently low scores Hunter-alpha: New 1M context stealth model on openrouter. Possibly Qwen-3.5 max?
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
39
42
468
56.9K
Sam Paech
Sam Paech@sam_paech·
@flopsy42 @altryne Reasoning off for all of these evals, where possible. I find it never helps and usually harms performance on these tasks.
English
2
0
4
455
Sam Paech
Sam Paech@sam_paech·
@AdrianTMiranda Ofc there are different schools of thought on this. A judge ensemble is fairer -- but the unfortunate reality is that most LLM judges are easily impressed by heavily RL'd slop. I do use an ensemble on spiral-bench and will for eq-bench v4, where taste should not be decisive.
English
0
0
4
91
Sam Paech
Sam Paech@sam_paech·
@AdrianTMiranda Self-bias is small but real. You can think of this difference between judges as their "taste". In creative writing, taste matters: low taste judges poison the result. IMO claude is highest taste judge. It also scores highest on judgemark by a wide margin: eqbench.com/judgemark-v2.h…
English
1
0
4
583
Sam Paech
Sam Paech@sam_paech·
@KuittinenPetri The evals are length controlled. Eye checking some of the 5.4 responses, it looks like they fixed the main issue with the prior 5.x models -- constantly trying to make every sentence sound poetic or deep or figurative. It was an absolute cringefest, now it's pretty readable.
English
1
0
16
2.2K
Petri Kuittinen
Petri Kuittinen@KuittinenPetri·
@sam_paech I am getting very suspicious of these benchmarks. I guess they are graded automatically by Claude or something and give too many points for a long response vs short but good. gpt-5 series is NOT good in creative writing. gpt-4o and even gpt-4 (released 3 years ago!) were better.
English
4
1
42
2.7K
Sam Paech
Sam Paech@sam_paech·
@Bor1s88 Its rationalisations for the refusals are pretty funny to read. After the first refusal, it doubles and triples down on its reasoning and even refuses the debrief at the end. I guess it's an overzealous system prompt on API requests, or a safety classifier.
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
0
0
5
142
Boris Rusev
Boris Rusev@Bor1s88·
@sam_paech What do you mean "refusals"? Isn't Grok the least censored model? At least in my tests it's the only model that is willing to write NSFW fiction.
English
4
0
12
1.7K
Sam Paech
Sam Paech@sam_paech·
@Bor1s88 eqbench.com/results/eqbenc… The stated reasons for the refusals are varied: jailbreaking, data mining, underage scenarios, violent content. All clearly false positives. Seems like they overdid the safety training a bit.
English
0
0
18
1.5K
Sam Paech
Sam Paech@sam_paech·
GPT-5.3-chat shows a surprising & severe regression on EQ-Bench and Longform Writing. Lots of partial refusals on EQ-Bench. In the writing evals, the prose devolves to tiny 1-5 word paragraphs.
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
43
74
648
53.3K
Sam Paech
Sam Paech@sam_paech·
@Bayesian0_0 It's a reasonable concern. As a comparison, here's some preliminary results from the EQ-Bench4 prototype. It's scored by 3 judges: Gemini 3 pro, Opus-4.6 and Kimi-k2.5. Some shifts in ranking -- though the task is somewhat different from eqbench 3.
Sam Paech tweet media
English
0
0
10
441
Bayesian
Bayesian@Bayesian0_0·
@sam_paech > EQ-Bench 3 is a LLM-judged test judged by Claude Opus 4.6 I'd be rly curious to see the scores as judged by Gemini 3 Pro or GPT 5.2, or an average of those 3. Would reduce my own concern that this is measuring Opus' self-preference (they prolly train on opus-judge for writing)
English
1
0
22
1.1K
Sam Paech
Sam Paech@sam_paech·
Sonnet-4.6 takes top place on all my evals: EQ-Bench, Creative writing, Longform writing & Judgemark. Opus 4.6 within margin of error. GLM-5 and Qwen3.5-397B nipping at their heels.
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
13
17
221
23.1K
Sam Paech
Sam Paech@sam_paech·
@dejavucoder No opinion about sonnet 4.6's instruction following, was just having a whinge about one of opus 4.6's more annoying tendencies. There are a lot of instruction following evals around, but imo they aren't super useful as they get saturated & don't hit the failure modes.
English
1
0
2
90
sankalp
sankalp@dejavucoder·
@sam_paech u mean to say sonnet 4.6 has better instruction following than opus 4.6? i am tryong to figure out for my work if its instruction following is at par with opus 4.5 for like 50k input token tasks with lots of stylistic directions
English
1
0
1
123
Sam Paech
Sam Paech@sam_paech·
@_coagulopath_ I noticed that it shoots for phrases that need to be unpacked. I guess it's impressive to LLM judges because they gloss over it & never notice the incoherence. Probably a consequence of Anthropic leaning more on RL I do enjoy the direction it's going with imperfect humanisation
English
2
0
1
275
COAGULOPATH
COAGULOPATH@_coagulopath_·
@sam_paech Big regress on creative writing, IMO—the model feels awful and overcooked. Everything's buried under overly-fussy hedging and complications and qualifiers. Like, what are we doing here? It's barely comprehensible.
COAGULOPATH tweet media
English
1
0
4
587
Sam Paech
Sam Paech@sam_paech·
@HenkPoley Fitting with a polynomial will probably get closer to what your eye is expecting
English
1
0
0
357
Henk Poley
Henk Poley@HenkPoley·
@sam_paech Asked Codex to analyse the longform score-degradation point cloud. It came up with some Kernel Density Estimation (KDE). A bummer that the fit of the p50 doesn't cross 0 degradation. So there is no guarantee for a zero degradation model within your current longform EQ bench.
Henk Poley tweet media
English
2
0
3
583
Sam Paech
Sam Paech@sam_paech·
@jia_li I had some old results files from before I switched judges. I've just updated them. Thanks for pointing it out!
English
0
0
1
89
Million
Million@jia_li·
@sam_paech I checked the scores, the avg scores of opus 4.6 are much worse than O3 obvious, maybe something is wrong.
Million tweet media
English
1
0
1
53
Sam Paech
Sam Paech@sam_paech·
Opus 4.6 dominated.
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
22
16
317
55.9K
Sam Paech
Sam Paech@sam_paech·
@roanoke_gal Probably from overfitting on LLM judge preferences. Both the judge and the rubric in the writing evals could use a refresh tbh.
English
0
0
23
2.6K
roanoke_gal
roanoke_gal@roanoke_gal·
@sam_paech How tf is 5.2 so high up? It's been utter rubbish compared to 5.1 when I try writing fiction with it (in API, mind you) due to all the in-built safety training.
English
1
0
17
2.8K