voratiq

101 posts

voratiq

@voratiq

Which coding agent wins on real work?

SF شامل ہوئے Eylül 2025

0 فالونگ107 فالوورز

voratiq@voratiq·4h

Sending out a deep dive to our subscribers early next week → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

359

voratiq@voratiq·4h

Still noisy though, will keep testing! This is all within an agentic coding context, on real SWE tasks Of course, with more data, across more domains, results could shift

English

391

voratiq@voratiq·4h

After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56

English

3.7K

voratiq@voratiq·7h

@m_im_ha #faqs" target="_blank" rel="nofollow noopener">voratiq.com/#faqs

QME

174

Imran@m_im_ha·15h

@voratiq What?! Can you be more in details.. How did you fully benchmark it?

English

556

voratiq@voratiq·1d

GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh The task was a tricky performance optimization in an internal code-analysis product First time we've seen an open-weight agent outperform the top closed agents Very interesting result...

English

414

19.7K

voratiq@voratiq·1d

@zhihanz1205 It's pretty barebones! We just use a simple guardrails extension that keeps tool output from blowing up the context.

English

1.2K

zhihanz@zhihanz1205·1d

@voratiq what is your pi configuration?

English

1.3K

voratiq@voratiq·3d

Want more insights on how coding agents perform on real work? The full Fable 5 breakdown (performance, cost, the win matrix, methodology) just went out to subscribers Subscribe for the next one → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

235

voratiq@voratiq·3d

Fable 5 debuts at #1 on the Voratiq leaderboard, with an impressive margin over every previous leader It excelled in hard & extra-hard tasks, but was outcompeted by weaker models on medium-difficulty ones And it's expensive! So, Fable is the new SOTA, just not for every task.

English

364

voratiq@voratiq·5d

Subscribe to our newsletter to get the Fable 5 deep dive when it drops → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

voratiq@voratiq·5d

Early, but, we'll just go ahead and say... Fable 5 is the strongest coding agent we've ever tested. Full results and analysis on Monday.

English

350

voratiq@voratiq·9 Haz

First Fable 5 run complete...

English

394

voratiq@voratiq·9 Haz

Assuming they all clear a capability floor, which is what we see here for the first time Decorrelation without baseline performance isn't helpful Which was the case for many months, as GPT had such a remarkable lead over the Anthropic and Gemini agents

English

103

voratiq@voratiq·9 Haz

Opus 4.8 xhigh performing at this level is exciting for multi-agent system design Same-agent or same-family systems can help with things like context management But generally you get higher performance overall when the agents have decorrelated strengths and weaknesses

voratiq@voratiq

Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%

English

256

voratiq@voratiq·9 Haz

If only someone actively tracked agent performance as a function of latency, cost, and reasoning level using a continuously evolving test set of real software engineering tasks. That would be useful.

Noam Brown@polynoamial

x.com/i/article/2057…

English

250

voratiq@voratiq·9 Haz

Full results: voratiq.com

English

117

voratiq@voratiq·9 Haz

English

921

voratiq@voratiq·5 Haz

🍿

QME

301

voratiq@voratiq·5 Haz

also - opus 4.7 xhigh is even more decorrelated, but it's too weak to be competitive

English

voratiq@voratiq·5 Haz

altho GPT-5.5 xhigh is the strongest model we’ve measured 5.3-codex high is a strong complement - it has high performance and decorrelated error profile to 5.5 xhigh 5.4 fails similarly to 5.5 xhigh, e.g. it is a less effective hedge

English

362

voratiq@voratiq·6 May

Full leaderboard and methodology: voratiq.com/leaderboard/

English

167

voratiq@voratiq·6 May

Also, interestingly, 5.5 xhigh wins broadly across task types, not just in one category Usually wins are more concentrated than this

English

232

voratiq@voratiq·6 May

Current GPT-5.5 results from 40 head-to-head engineering runs gpt-5-5-xhigh is the strongest coding agent we've evaluated so far But gpt-5-5-high and gpt-5-5 are surprisingly weaker, both losing to their 5.4 equivalents

English

501

دریافت کریں

@m_im_ha @zhihanz1205 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA