voratiq (@voratiq) - Perfil do Twitter | Zamantika Mersobahis Locabet

voratiq@voratiq·1h

@jeremyphoward @Zai_org x.com/voratiq/status…

After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56

QME

0

1

1K

Jeremy Howard@jeremyphoward·3h

Wow. @Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It's super fast, inexpensive, and not too verbose. It responds with nuance and judgement, & handles long context VERY well. I've never experienced an open weights model like this before.

English

63

120

1.7K

105.5K

voratiq@voratiq·9h

Sending out a deep dive to our subscribers early next week → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

0

3

663

voratiq@voratiq·9h

Still noisy though, will keep testing! This is all within an agentic coding context, on real SWE tasks Of course, with more data, across more domains, results could shift

English

1

0

4

715

voratiq@voratiq·9h

After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56

English

4

10

126

9K

voratiq@voratiq·12h

@m_im_ha #faqs" target="_blank" rel="nofollow noopener">voratiq.com/#faqs

QME

0

195

Imran@m_im_ha·20h

@voratiq What?! Can you be more in details.. How did you fully benchmark it?

English

2

0

580

voratiq@voratiq·1d

GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh The task was a tricky performance optimization in an internal code-analysis product First time we've seen an open-weight agent outperform the top closed agents Very interesting result...

English

17

15

419

20.1K

voratiq@voratiq·1d

@zhihanz1205 It's pretty barebones! We just use a simple guardrails extension that keeps tool output from blowing up the context.

English

0

6

1.2K

zhihanz@zhihanz1205·1d

@voratiq what is your pi configuration?

English

1

0

1

1.3K

voratiq@voratiq·3d

Want more insights on how coding agents perform on real work? The full Fable 5 breakdown (performance, cost, the win matrix, methodology) just went out to subscribers Subscribe for the next one → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

0

2

249

voratiq@voratiq·3d

Fable 5 debuts at #1 on the Voratiq leaderboard, with an impressive margin over every previous leader It excelled in hard & extra-hard tasks, but was outcompeted by weaker models on medium-difficulty ones And it's expensive! So, Fable is the new SOTA, just not for every task.

English

1

0

5

381

voratiq@voratiq·5d

Subscribe to our newsletter to get the Fable 5 deep dive when it drops → #subscribe" target="_blank" rel="nofollow noopener">voratiq.com/#subscribe

English

0

2

104

voratiq@voratiq·5d

Early, but, we'll just go ahead and say... Fable 5 is the strongest coding agent we've ever tested. Full results and analysis on Monday.

English

1

4

357

voratiq@voratiq·9 Haz

First Fable 5 run complete...

English

0

5

398

voratiq@voratiq·9 Haz

Assuming they all clear a capability floor, which is what we see here for the first time Decorrelation without baseline performance isn't helpful Which was the case for many months, as GPT had such a remarkable lead over the Anthropic and Gemini agents

English

0

1

106

voratiq@voratiq·9 Haz

Opus 4.8 xhigh performing at this level is exciting for multi-agent system design Same-agent or same-family systems can help with things like context management But generally you get higher performance overall when the agents have decorrelated strengths and weaknesses

voratiq@voratiq

Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%

English

1

2

262

voratiq@voratiq·9 Haz

If only someone actively tracked agent performance as a function of latency, cost, and reasoning level using a continuously evolving test set of real software engineering tasks. That would be useful.

Noam Brown@polynoamial

x.com/i/article/2057…

English

0

5

254

voratiq@voratiq·9 Haz

Full results: voratiq.com

English

0

1

118

voratiq@voratiq·9 Haz

Leaderboard update! Opus 4.8 xhigh takes #1, a clear step over 4.7 - though its edge on GPT-5.5 xhigh is within noise For Qwen 3.6, the dense 27B strongly outperforms the 35B-A3B MoE - with a head-to-head edge of ~89%

English

1

0

5

928

voratiq@voratiq·5 Haz

🍿

QME

1

0

4

302

voratiq@voratiq·5 Haz

also - opus 4.7 xhigh is even more decorrelated, but it's too weak to be competitive

English

0

1

96

voratiq@voratiq·5 Haz

altho GPT-5.5 xhigh is the strongest model we’ve measured 5.3-codex high is a strong complement - it has high performance and decorrelated error profile to 5.5 xhigh 5.4 fails similarly to 5.5 xhigh, e.g. it is a less effective hedge

English

1

4

363

voratiq

Descobrir