voratiq

101 posts

voratiq banner
voratiq

voratiq

@voratiq

Which coding agent wins on real work?

SF شامل ہوئے Eylül 2025
0 فالونگ107 فالوورز
voratiq
voratiq@voratiq·
Sending out a deep dive to our subscribers early next week → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
3
359
voratiq
voratiq@voratiq·
Still noisy though, will keep testing! This is all within an agentic coding context, on real SWE tasks Of course, with more data, across more domains, results could shift
English
1
0
4
391
voratiq
voratiq@voratiq·
After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56
voratiq tweet media
English
4
5
59
3.7K
Imran
Imran@m_im_ha·
@voratiq What?! Can you be more in details.. How did you fully benchmark it?
English
2
0
0
556
voratiq
voratiq@voratiq·
GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh The task was a tricky performance optimization in an internal code-analysis product First time we've seen an open-weight agent outperform the top closed agents Very interesting result...
voratiq tweet media
English
17
14
414
19.7K
voratiq
voratiq@voratiq·
@zhihanz1205 It's pretty barebones! We just use a simple guardrails extension that keeps tool output from blowing up the context.
English
0
0
6
1.2K
zhihanz
zhihanz@zhihanz1205·
@voratiq what is your pi configuration?
English
1
0
1
1.3K
voratiq
voratiq@voratiq·
Want more insights on how coding agents perform on real work? The full Fable 5 breakdown (performance, cost, the win matrix, methodology) just went out to subscribers Subscribe for the next one → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
2
235
voratiq
voratiq@voratiq·
Fable 5 debuts at #1 on the Voratiq leaderboard, with an impressive margin over every previous leader It excelled in hard & extra-hard tasks, but was outcompeted by weaker models on medium-difficulty ones And it's expensive! So, Fable is the new SOTA, just not for every task.
voratiq tweet media
English
1
0
5
364
voratiq
voratiq@voratiq·
Subscribe to our newsletter to get the Fable 5 deep dive when it drops → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
2
99
voratiq
voratiq@voratiq·
Early, but, we'll just go ahead and say... Fable 5 is the strongest coding agent we've ever tested. Full results and analysis on Monday.
voratiq tweet media
English
1
1
4
350
voratiq
voratiq@voratiq·
First Fable 5 run complete...
voratiq tweet media
English
0
0
5
394
voratiq
voratiq@voratiq·
Assuming they all clear a capability floor, which is what we see here for the first time Decorrelation without baseline performance isn't helpful Which was the case for many months, as GPT had such a remarkable lead over the Anthropic and Gemini agents
English
0
0
1
103
voratiq
voratiq@voratiq·
Opus 4.8 xhigh performing at this level is exciting for multi-agent system design Same-agent or same-family systems can help with things like context management But generally you get higher performance overall when the agents have decorrelated strengths and weaknesses
voratiq@voratiq

Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%

English
1
1
2
256
voratiq
voratiq@voratiq·
If only someone actively tracked agent performance as a function of latency, cost, and reasoning level using a continuously evolving test set of real software engineering tasks. That would be useful.
Noam Brown@polynoamial

x.com/i/article/2057…

English
0
0
5
250
voratiq
voratiq@voratiq·
Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%
voratiq tweet media
English
1
0
5
921
voratiq
voratiq@voratiq·
🍿
voratiq tweet media
QME
1
0
4
301
voratiq
voratiq@voratiq·
also - opus 4.7 xhigh is even more decorrelated, but it's too weak to be competitive
English
0
0
1
95
voratiq
voratiq@voratiq·
altho GPT-5.5 xhigh is the strongest model we’ve measured 5.3-codex high is a strong complement - it has high performance and decorrelated error profile to 5.5 xhigh 5.4 fails similarly to 5.5 xhigh, e.g. it is a less effective hedge
voratiq tweet media
English
1
1
4
362
voratiq
voratiq@voratiq·
Also, interestingly, 5.5 xhigh wins broadly across task types, not just in one category Usually wins are more concentrated than this
voratiq tweet media
English
1
0
2
232
voratiq
voratiq@voratiq·
Current GPT-5.5 results from 40 head-to-head engineering runs gpt-5-5-xhigh is the strongest coding agent we've evaluated so far But gpt-5-5-high and gpt-5-5 are surprisingly weaker, both losing to their 5.4 equivalents
voratiq tweet media
English
1
0
5
501