WolfBench

7 posts

WolfBench banner
WolfBench

WolfBench

@WolfBenchAI

https://t.co/IGlGiHLlnE // @WolframRvnwlf's new evaluation framework for models and agents: because one score is not enough! // brought to you by @CoreWeave/@wandb

Inscrit le Mart 2026
15 Abonnements26 Abonnés
Tweet épinglé
WolfBench
WolfBench@WolfBenchAI·
Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfb…
English
3
4
18
4.4K
WolfBench
WolfBench@WolfBenchAI·
What we see here is not only GPT 5.4's average score raising from abysmal 31% to the top score of 61% (thinking: low) or even 71% (t: xhigh), but also the solid base - tasks it always solves - from 7% to 45% or even 52%. That means it's not only good on average, but solidly good and constantly reliable. Its ceiling rose to astounding 85% on xhigh, so in theory it could solve almost all of the tasks. If you have the funds, this looks to be your best choice. But if you want to save some money, using the default low thinking still is second only to this.
English
0
0
1
119
WolfBench
WolfBench@WolfBenchAI·
GPT 5.4 is not just more reliable now with the latest @openclaw version, it's the best model I've tested on @WolfBenchAI, surpassing even Opus 4.6 with just its default settings (low reasoning for GPT, adaptive reasoning for Opus). And with xhigh thinking, it goes even higher! 🚀
WolfBench tweet media
Peter Steinberger 🦞@steipete

New @openclaw beta bits are up! With Hunter🏹 Alpha (1M context!) and Healer🩹 Alpha FREE stealth models from @OpenRouter Also, GPT 5.4 and @Kimi_Moonshot Coding now are more reliable, and lots of fixes around ACP and message handling. github.com/openclaw/openc…

English
1
0
7
2.9K
WolfBench
WolfBench@WolfBenchAI·
@WolframRvnwlf @wandb Most benchmarks give you one number. But: A single score tells you almost nothing. That's because performance is a distribution, not a point. WolfBench shows four metrics: the rock-solid base you can always count on, the average, the best a single run achieved, and the ceiling.
WolfBench tweet media
English
1
0
2
159
WolfBench
WolfBench@WolfBenchAI·
Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfb…
English
3
4
18
4.4K