WolfBench (@WolfBenchAI) - Profil Twitter | Zamantika Mersobahis Locabet

Tweet épinglé

WolfBench@WolfBenchAI·10 Mar

Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfb…

English

3

4

18

4.4K

WolfBench@WolfBenchAI·13 Mar

What we see here is not only GPT 5.4's average score raising from abysmal 31% to the top score of 61% (thinking: low) or even 71% (t: xhigh), but also the solid base - tasks it always solves - from 7% to 45% or even 52%. That means it's not only good on average, but solidly good and constantly reliable. Its ceiling rose to astounding 85% on xhigh, so in theory it could solve almost all of the tasks. If you have the funds, this looks to be your best choice. But if you want to save some money, using the default low thinking still is second only to this.

English

0

1

119

WolfBench@WolfBenchAI·13 Mar

GPT 5.4 is not just more reliable now with the latest @openclaw version, it's the best model I've tested on @WolfBenchAI, surpassing even Opus 4.6 with just its default settings (low reasoning for GPT, adaptive reasoning for Opus). And with xhigh thinking, it goes even higher! 🚀

Peter Steinberger 🦞@steipete

New @openclaw beta bits are up! With Hunter🏹 Alpha (1M context!) and Healer🩹 Alpha FREE stealth models from @OpenRouter Also, GPT 5.4 and @Kimi_Moonshot Coding now are more reliable, and lots of fixes around ACP and message handling. github.com/openclaw/openc…

English

1

0

7

2.9K

WolfBench@WolfBenchAI·12 Mar

x.com/i/article/2031…

ZXX

0

1

1.1K

WolfBench retweeté

Alex Volkov@altryne·10 Mar

@wandb @zubinaysola Another drop today, we first announced on @thursdai_pod , we launched wolfbench.ai! x.com/WolfBenchAI/st…

WolfBench@WolfBenchAI

Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfb…

English

1

2

778

WolfBench@WolfBenchAI·10 Mar

@WolframRvnwlf @wandb We're just getting started! 🚀 Learn all about it here and explore the latest results interactively: wolfbench.ai

English

1

0

1

230

WolfBench@WolfBenchAI·10 Mar

@WolframRvnwlf @wandb Most benchmarks give you one number. But: A single score tells you almost nothing. That's because performance is a distribution, not a point. WolfBench shows four metrics: the rock-solid base you can always count on, the average, the best a single run achieved, and the ceiling.

English

1

0

2

159

WolfBench@WolfBenchAI·10 Mar

Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfb…

English

3

4

18

4.4K

WolfBench

Découvrir