Sam Paech

1.2K posts

Sam Paech

@sam_paech

Evals @liquidai Maintainer of EQ-Bench https://t.co/Jy56OlHrP5 https://t.co/oRApPQwvWS

Melbourne, Australia Se unió Temmuz 2012

219 Siguiendo3.9K Seguidores

Sam Paech@sam_paech·6d

@pigeon__s Yeah I will probably update it when the next gen sonnet comes out. I prefer to skip a few generations as it's expensive & time consuming to re-score the whole leaderboard.

English

360

ρ:ɡeσn@pigeon__s·6d

hey @sam_paech can you finally fucking update your leaderboard to be judged by sonnet 4.6 in light of recent deprecation and preferably update it more frequently in the future as well since sonnet 4.8 probably is coming out within the next month

English

1.7K

Sam Paech@sam_paech·21 Mar

The Qwen3.5 models really took over the pareto for LLM-judging. Local models that are actually capable at data scoring is a huge accelerator imo.

English

437

26.3K

Sam Paech@sam_paech·15 Mar

@emsi_kil3r @QuixiAI I'm testing via api, so that might be a difference vs what you are experiencing, if you are using chatgpt.com. What don't you like about 5.4's creative writing?

English

697

Emsi@emsi_kil3r·15 Mar

Lol. That's not possible. GPT models and gpt-5.4 in particular are notoriously bad at creative writing. To the point of being just horrible. Not sure what were the criteria but clearly there's something fishy going on or there was a special promot that activates creative mode in gpt.

English

825

Sam Paech@sam_paech·15 Mar

New results! GPT-5.4: Places 1st, 2nd & 3rd on creative writing, longform writing & EQ-Bench respectively Grok-4.20: Refusals everywhere, and consequently low scores Hunter-alpha: New 1M context stealth model on openrouter. Possibly Qwen-3.5 max?

English

468

56.9K

Sam Paech@sam_paech·15 Mar

@flopsy42 @altryne Reasoning off for all of these evals, where possible. I find it never helps and usually harms performance on these tasks.

English

455

David K.@flopsy42·15 Mar

@sam_paech @altryne Which reasoning_effort did you use ?

English

438

Sam Paech@sam_paech·15 Mar

@AdrianTMiranda Ofc there are different schools of thought on this. A judge ensemble is fairer -- but the unfortunate reality is that most LLM judges are easily impressed by heavily RL'd slop. I do use an ensemble on spiral-bench and will for eq-bench v4, where taste should not be decisive.

English

Sam Paech@sam_paech·15 Mar

@AdrianTMiranda Self-bias is small but real. You can think of this difference between judges as their "taste". In creative writing, taste matters: low taste judges poison the result. IMO claude is highest taste judge. It also scores highest on judgemark by a wide margin: eqbench.com/judgemark-v2.h…

English

583

Sam Paech@sam_paech·15 Mar

@koltregaskes Thx, fixed!

English

906

Kol Tregaskes@koltregaskes·15 Mar

@sam_paech I couldn't click on your liquidai link in your profile, did you mean? x.com/liquidai

English

1.2K

Sam Paech@sam_paech·15 Mar

@KuittinenPetri The evals are length controlled. Eye checking some of the 5.4 responses, it looks like they fixed the main issue with the prior 5.x models -- constantly trying to make every sentence sound poetic or deep or figurative. It was an absolute cringefest, now it's pretty readable.

English

2.2K

Petri Kuittinen@KuittinenPetri·15 Mar

@sam_paech I am getting very suspicious of these benchmarks. I guess they are graded automatically by Claude or something and give too many points for a long response vs short but good. gpt-5 series is NOT good in creative writing. gpt-4o and even gpt-4 (released 3 years ago!) were better.

English

2.7K

Sam Paech@sam_paech·15 Mar

@Bor1s88 Its rationalisations for the refusals are pretty funny to read. After the first refusal, it doubles and triples down on its reasoning and even refuses the debrief at the end. I guess it's an overzealous system prompt on API requests, or a safety classifier.

English

142

Boris Rusev@Bor1s88·15 Mar

@sam_paech What do you mean "refusals"? Isn't Grok the least censored model? At least in my tests it's the only model that is willing to write NSFW fiction.

English

1.7K

Sam Paech@sam_paech·15 Mar

@Bor1s88 eqbench.com/results/eqbenc… The stated reasons for the refusals are varied: jailbreaking, data mining, underage scenarios, violent content. All clearly false positives. Seems like they overdid the safety training a bit.

English

1.5K

Sam Paech@sam_paech·4 Mar

GPT-5.3-chat shows a surprising & severe regression on EQ-Bench and Longform Writing. Lots of partial refusals on EQ-Bench. In the writing evals, the prose devolves to tiny 1-5 word paragraphs.

English

648

53.3K

Sam Paech@sam_paech·19 Şub

@Bayesian0_0 It's a reasonable concern. As a comparison, here's some preliminary results from the EQ-Bench4 prototype. It's scored by 3 judges: Gemini 3 pro, Opus-4.6 and Kimi-k2.5. Some shifts in ranking -- though the task is somewhat different from eqbench 3.

English

441

Bayesian@Bayesian0_0·19 Şub

@sam_paech > EQ-Bench 3 is a LLM-judged test judged by Claude Opus 4.6 I'd be rly curious to see the scores as judged by Gemini 3 Pro or GPT 5.2, or an average of those 3. Would reduce my own concern that this is measuring Opus' self-preference (they prolly train on opus-judge for writing)

English

1.1K

Sam Paech@sam_paech·19 Şub

Sonnet-4.6 takes top place on all my evals: EQ-Bench, Creative writing, Longform writing & Judgemark. Opus 4.6 within margin of error. GLM-5 and Qwen3.5-397B nipping at their heels.

English

221

23.1K

Sam Paech@sam_paech·19 Şub

@dejavucoder No opinion about sonnet 4.6's instruction following, was just having a whinge about one of opus 4.6's more annoying tendencies. There are a lot of instruction following evals around, but imo they aren't super useful as they get saturated & don't hit the failure modes.

English

sankalp@dejavucoder·19 Şub

@sam_paech u mean to say sonnet 4.6 has better instruction following than opus 4.6? i am tryong to figure out for my work if its instruction following is at par with opus 4.5 for like 50k input token tasks with lots of stylistic directions

English

123

sankalp@dejavucoder·19 Şub

well this is amazing. i wonder how is the instruction following. is it better/at par with opus 4.5?

Sam Paech@sam_paech

Sonnet-4.6 takes top place on all my evals: EQ-Bench, Creative writing, Longform writing & Judgemark. Opus 4.6 within margin of error. GLM-5 and Qwen3.5-397B nipping at their heels.

English

2.3K

Sam Paech@sam_paech·19 Şub

@_coagulopath_ I noticed that it shoots for phrases that need to be unpacked. I guess it's impressive to LLM judges because they gloss over it & never notice the incoherence. Probably a consequence of Anthropic leaning more on RL I do enjoy the direction it's going with imperfect humanisation

English

275

COAGULOPATH@_coagulopath_·19 Şub

@sam_paech Big regress on creative writing, IMO—the model feels awful and overcooked. Everything's buried under overly-fussy hedging and complications and qualifiers. Like, what are we doing here? It's barely comprehensible.

English

587

Sam Paech@sam_paech·19 Şub

@HenkPoley Fitting with a polynomial will probably get closer to what your eye is expecting

English

357

Henk Poley@HenkPoley·19 Şub

@sam_paech Asked Codex to analyse the longform score-degradation point cloud. It came up with some Kernel Density Estimation (KDE). A bummer that the fit of the p50 doesn't cross 0 degradation. So there is no guarantee for a zero degradation model within your current longform EQ bench.

English

583

Sam Paech@sam_paech·13 Şub

@jia_li I had some old results files from before I switched judges. I've just updated them. Thanks for pointing it out!

English

Million@jia_li·13 Şub

@sam_paech I checked the scores, the avg scores of opus 4.6 are much worse than O3 obvious, maybe something is wrong.

English

Sam Paech@sam_paech·6 Şub

Opus 4.6 dominated.

English

317

55.9K

Sam Paech@sam_paech·6 Şub

@roanoke_gal Probably from overfitting on LLM judge preferences. Both the judge and the rubric in the writing evals could use a refresh tbh.

English

2.6K

roanoke_gal@roanoke_gal·6 Şub

@sam_paech How tf is 5.2 so high up? It's been utter rubbish compared to 5.1 when I try writing fiction with it (in API, mind you) due to all the in-built safety training.

English

2.8K

Descubrir

@pigeon__s @emsi_kil3r @QuixiAI @flopsy42 @altryne @AdrianTMiranda @koltregaskes @KuittinenPetri