voratiq

102 posts

voratiq banner
voratiq

voratiq

@voratiq

Which coding agent wins on real work?

SF Entrou em Eylül 2025
0 Seguindo113 Seguidores
Jeremy Howard
Jeremy Howard@jeremyphoward·
Wow. @Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It's super fast, inexpensive, and not too verbose. It responds with nuance and judgement, & handles long context VERY well. I've never experienced an open weights model like this before.
English
63
120
1.7K
105.5K
voratiq
voratiq@voratiq·
Sending out a deep dive to our subscribers early next week → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
3
663
voratiq
voratiq@voratiq·
Still noisy though, will keep testing! This is all within an agentic coding context, on real SWE tasks Of course, with more data, across more domains, results could shift
English
1
0
4
715
voratiq
voratiq@voratiq·
After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56
voratiq tweet media
English
4
10
126
9K
Imran
Imran@m_im_ha·
@voratiq What?! Can you be more in details.. How did you fully benchmark it?
English
2
0
0
580
voratiq
voratiq@voratiq·
GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh The task was a tricky performance optimization in an internal code-analysis product First time we've seen an open-weight agent outperform the top closed agents Very interesting result...
voratiq tweet media
English
17
15
419
20.1K
voratiq
voratiq@voratiq·
@zhihanz1205 It's pretty barebones! We just use a simple guardrails extension that keeps tool output from blowing up the context.
English
0
0
6
1.2K
zhihanz
zhihanz@zhihanz1205·
@voratiq what is your pi configuration?
English
1
0
1
1.3K
voratiq
voratiq@voratiq·
Want more insights on how coding agents perform on real work? The full Fable 5 breakdown (performance, cost, the win matrix, methodology) just went out to subscribers Subscribe for the next one → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
2
249
voratiq
voratiq@voratiq·
Fable 5 debuts at #1 on the Voratiq leaderboard, with an impressive margin over every previous leader It excelled in hard & extra-hard tasks, but was outcompeted by weaker models on medium-difficulty ones And it's expensive! So, Fable is the new SOTA, just not for every task.
voratiq tweet media
English
1
0
5
381
voratiq
voratiq@voratiq·
Subscribe to our newsletter to get the Fable 5 deep dive when it drops → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe
English
0
0
2
104
voratiq
voratiq@voratiq·
Early, but, we'll just go ahead and say... Fable 5 is the strongest coding agent we've ever tested. Full results and analysis on Monday.
voratiq tweet media
English
1
1
4
357
voratiq
voratiq@voratiq·
First Fable 5 run complete...
voratiq tweet media
English
0
0
5
398
voratiq
voratiq@voratiq·
Assuming they all clear a capability floor, which is what we see here for the first time Decorrelation without baseline performance isn't helpful Which was the case for many months, as GPT had such a remarkable lead over the Anthropic and Gemini agents
English
0
0
1
106
voratiq
voratiq@voratiq·
Opus 4.8 xhigh performing at this level is exciting for multi-agent system design Same-agent or same-family systems can help with things like context management But generally you get higher performance overall when the agents have decorrelated strengths and weaknesses
voratiq@voratiq

Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%

English
1
1
2
262
voratiq
voratiq@voratiq·
If only someone actively tracked agent performance as a function of latency, cost, and reasoning level using a continuously evolving test set of real software engineering tasks. That would be useful.
Noam Brown@polynoamial

x.com/i/article/2057…

English
0
0
5
254
voratiq
voratiq@voratiq·
Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%
voratiq tweet media
English
1
0
5
928
voratiq
voratiq@voratiq·
🍿
voratiq tweet media
QME
1
0
4
302
voratiq
voratiq@voratiq·
also - opus 4.7 xhigh is even more decorrelated, but it's too weak to be competitive
English
0
0
1
96
voratiq
voratiq@voratiq·
altho GPT-5.5 xhigh is the strongest model we’ve measured 5.3-codex high is a strong complement - it has high performance and decorrelated error profile to 5.5 xhigh 5.4 fails similarly to 5.5 xhigh, e.g. it is a less effective hedge
voratiq tweet media
English
1
1
4
363