VulcanBench (@VulcanBench) - โปรไฟล์ Twitter

VulcanBench@VulcanBench·35m

@LoopOnChain @morganlinton That's a great question, and haven't done a comparison there yet but definitely sounds like an interesting one to evaluate!

English

0

2

5

Amart (LOOP)⚡️@LoopOnChain·42m

@morganlinton @VulcanBench Thanks for doing this! Any thoughts on sonnets tone and how it talks. I've noticed it actually speaks less "ai" than opus, maybe just me

English

1

0

1

24

Morgan@morganlinton·1h

Just finished my first set of benchmarks comparing Sonnet 5 and Opus 4.8 on @VulcanBench across effort levels. This is the first of a number of benchmarks I plan on doing with Sonnet 5, but I wanted to get something kinda unique and different out there. And yes, don't worry, I'll do a GLM 5.2 comparison. I don't think many benchmarks are looking at how changing effort levels impacts accuracy, so I thought this could be an interesting angle to start with. In total I ran 936 test runs across both models, and three different effort levels. Here's the high-level results, will be sharing a more detailed overview in a full report tomorrow: 💸 Sonnet 5 matches Opus 4.8's accuracy at roughly HALF the cost per run. 🎯 Sonnet 5 at high effort is the only config to solve all 52 tasks (100%), and it still costs less than Opus at high effort. 📈 Reasoning effort scales Sonnet 5 (97 to 100%) but does nothing for Opus 4.8 (flat 97 to 98%). Extra thinking is wasted on Opus here. 🧮 Every single Opus cell is Pareto-dominated on cost vs quality. ⚡ Opus's one edge: fewer tokens, faster runs. Sonnet "thinks" ~2x harder at high effort but wins on price anyway (its tokens are 3/5 the cost). ⚠️ Honesty check: this suite saturates the frontier, so the accuracy gaps are tiny (1 to 3 tasks). The real signal is cost and how each model responds to effort, not raw capability. And please remember, I'm not some kind of benchmarking expert, this is new territory for me so I'm testing and learning as I go. Still working on updated evals that will be harder for these models so I can start to get scores in the 80% - 90% range. That being said, some interesting insights from this benchmark run, and gives me a lot more ideas for the next run!

English

5

1

18

955

VulcanBench@VulcanBench·36m

If you found this useful, I spent ~$100 to run this, no sponsors, nobody funding this but myself, please feel to like and share this if you think others would find it useful too! More to come. Live long and benchmark 🖖

English

0

1

15

VulcanBench@VulcanBench·36m

And of course, the key findings:

English

1

0

17

VulcanBench@VulcanBench·36m

And here's a quick thread with some more details about the Sonnet 5 vs. Opus 4.8 benchmark completed today. My first VulcanVench 🧵

English

1

3

524

VulcanBench@VulcanBench·51m

The first Sonnet 5 vs. Opus 4.8 benchmark is now complete, and with over 900 runs. This tests relatively routine coding tasks an engineer might give a coding agent. What it shows is, for most normal, everyday tasks, giving these to Opus, is probably overkill. Sonnet 5 can actually handle a lot more than you would think. Dialing up the difficulty and seeing where there might be more differentiation, but some very interesting insights from this first benchmark. More to come!

Morgan@morganlinton

Just finished my first set of benchmarks comparing Sonnet 5 and Opus 4.8 on @VulcanBench across effort levels. This is the first of a number of benchmarks I plan on doing with Sonnet 5, but I wanted to get something kinda unique and different out there. And yes, don't worry, I'll do a GLM 5.2 comparison. I don't think many benchmarks are looking at how changing effort levels impacts accuracy, so I thought this could be an interesting angle to start with. In total I ran 936 test runs across both models, and three different effort levels. Here's the high-level results, will be sharing a more detailed overview in a full report tomorrow: 💸 Sonnet 5 matches Opus 4.8's accuracy at roughly HALF the cost per run. 🎯 Sonnet 5 at high effort is the only config to solve all 52 tasks (100%), and it still costs less than Opus at high effort. 📈 Reasoning effort scales Sonnet 5 (97 to 100%) but does nothing for Opus 4.8 (flat 97 to 98%). Extra thinking is wasted on Opus here. 🧮 Every single Opus cell is Pareto-dominated on cost vs quality. ⚡ Opus's one edge: fewer tokens, faster runs. Sonnet "thinks" ~2x harder at high effort but wins on price anyway (its tokens are 3/5 the cost). ⚠️ Honesty check: this suite saturates the frontier, so the accuracy gaps are tiny (1 to 3 tasks). The real signal is cost and how each model responds to effort, not raw capability. And please remember, I'm not some kind of benchmarking expert, this is new territory for me so I'm testing and learning as I go. Still working on updated evals that will be harder for these models so I can start to get scores in the 80% - 90% range. That being said, some interesting insights from this benchmark run, and gives me a lot more ideas for the next run!

English

0

1

2

40

VulcanBench@VulcanBench·7h

Very interesting comparison of GLM 5.2 and Sonnet 5. And yes, you can expect this comparison will be something we do with VulcanBench this week too.

Max Weinbach@mweinbach

Sonnet 5 medium is better than GLM 5.2 high and roughly the same price hilarious tbh

English

0

2

187

VulcanBench@VulcanBench·7h

@mweinbach Super interesting Max, thanks for sharing.

English

0

266

Max Weinbach@mweinbach·7h

Here are the final models from both Claude: docs.google.com/spreadsheets/d… Gemini: docs.google.com/spreadsheets/d…

English

4

3

19

4.2K

Max Weinbach@mweinbach·7h

Just ran a prompt in our @DiligenceStack agent with Claude Sonnet 5 and Gemini 3.5 Flash, both high reasoning Claude was $18.41 Gemini was $1.12

English

13

14

238

31K

VulcanBench@VulcanBench·7h

Run status: ~2.4 runs/min, all authenticating and pricing correctly. Sweep is working through effort=low first, then medium, then high, for Sonnet 5, then the same for Opus 4.8. ETA ~5-7h

VulcanBench@VulcanBench

Running a number of benchmarks on Sonnet 5, the first one, is something unique I don't think anyone else is benchmarking right now. Evals looking at Opus 4.8 vs. Sonnet 5, across reasoning levels. My theory is that you can use Sonnet 5 in cases where you used to use Opus 4.8, but I'm curious what level of reasoning you can get away with. Most people never even try Low or Medium, I want to see if it might be time to dip back into the lower effort bucket with this model. Here's a rundown of what I'm going to test:

English

0

1

4

127

VulcanBench@VulcanBench·7h

@dedene @morganlinton Yes!

0

1

8

Peter Dedene@dedene·8h

@morganlinton @VulcanBench Looking forward to those results! 👀

English

1

0

1

20

Morgan@morganlinton·9h

Soooo excited for this!

ClaudeDevs@ClaudeDevs

Claude Sonnet 5 is here. Top-tier performance on coding and tool use at Sonnet pricing, with a 1M context window. It's the new default in Claude Code for Pro users, and available everywhere on the Claude Platform, including the API and Managed Agents.

English

3

0

19

2.7K

VulcanBench@VulcanBench·7h

Running a number of benchmarks on Sonnet 5, the first one, is something unique I don't think anyone else is benchmarking right now. Evals looking at Opus 4.8 vs. Sonnet 5, across reasoning levels. My theory is that you can use Sonnet 5 in cases where you used to use Opus 4.8, but I'm curious what level of reasoning you can get away with. Most people never even try Low or Medium, I want to see if it might be time to dip back into the lower effort bucket with this model. Here's a rundown of what I'm going to test:

English

1

0

3

2.7K

VulcanBench@VulcanBench·1d

Just finished running an initial test of the new Carbyne Tier tests, and they still aren't hard enough, so continuing to refine. Still, I want to share every step of the journey building the evals for VulcanBench, so here's a quick recap of how the hardest tier of evals performed across GLM 5.2, Opus 4.8, and GPT 5.5.

English

0

1

91

VulcanBench@VulcanBench·2d

@scouzi Ty so much for the kinds words 2damoon

English

0

13

2damoon@scouzi·2d

Excellent insights from VulcanBench. Liking this epo more and more.

VulcanBench@VulcanBench

Yesterday I added 12 new tasks to VulcanBench as I work on creating more difficult tasks. Did an overnight run with the new tasks, and looks like they still aren't quite hard enough. Still, learning more and more about GLM 5.2 as a model. It definitely has an over-thinking problem, where both GPT 5.5 and Opus 4.8 landed in the 20s - 30s range per task, GLM 5.2 was at 164s. What I think makes this particularly interesting is that people are saying that GLM 5.2 is significantly cheaper than GPT 5.5 and Opus 4.8, but on real coding tasks, while it is cheaper, it's not dramatically cheaper, and it is much slower. More to come, but still some interesting insights from this overnight run, which also happens to be the most expensive run I've done so far, so good to have some lessons learned from it!

English

1

0

1

12

VulcanBench@VulcanBench·2d

Added more difficult tasks to VulcanBench today, introducing two new tiers, Diamond and Carbyne. And yes, for Carbyne I asked Claude what was harder than a Diamond! See summary below of each, getting ready to run these against GLM 5.2, Opus 4.8 and GPT 5.5.

English

0

1

2

1.7K

VulcanBench@VulcanBench·2d

Yesterday I added 12 new tasks to VulcanBench as I work on creating more difficult tasks. Did an overnight run with the new tasks, and looks like they still aren't quite hard enough. Still, learning more and more about GLM 5.2 as a model. It definitely has an over-thinking problem, where both GPT 5.5 and Opus 4.8 landed in the 20s - 30s range per task, GLM 5.2 was at 164s. What I think makes this particularly interesting is that people are saying that GLM 5.2 is significantly cheaper than GPT 5.5 and Opus 4.8, but on real coding tasks, while it is cheaper, it's not dramatically cheaper, and it is much slower. More to come, but still some interesting insights from this overnight run, which also happens to be the most expensive run I've done so far, so good to have some lessons learned from it!

English

3

2

7

10.1K

VulcanBench@VulcanBench·3d

@shiri_shh Not sure this is the best benchmark to be using any more, not enough differentiation between models, everything falls within a margin of error. I am thinking something like DeepSWE will likely better illustrate the differences.

English

0

211

shirish@shiri_shh·4d

what happens once it hits 100% ?

OpenAI@OpenAI

GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests complex command-line workflows requiring planning, iteration, and tool coordination.

English

562

25

2.5K

410.7K

VulcanBench@VulcanBench·3d

I'm not sure TerminalBench is really showcasing how much better 5.6-Sol really is. This makes it look only very slightly incrementally better. I think there are likely better ways to benchmark these models that more accurately reflect the work real engineering teams do. Seeing most of these models pretty much tied, doesn't really make GPT-5.6 look like much of a breakthrough tbh, but my guess is, it is!

English

0

2

52

Greg Brockman@gdb·4d

GPT-5.6 Sol preview — it's a good model:

OpenAI@OpenAI

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…

English

584

417

7.6K

705.2K

VulcanBench@VulcanBench·3d

@BlackWolfNews @ChaiKamNamak @morganlinton @cursor_ai Well to be fair, I haven't considered all of that yet, still my first month doing this, so a lot more to learn and do!

English

0

2

8

BlackWolfNews@BlackWolfNews·4d

@ChaiKamNamak @morganlinton @cursor_ai @vulcanbench Morgan is building his own evaluation set. He has to consider what is "cheating", what is allowed, and how to guard or guide for the desired or expected result of his test or tests to be well reflected.

English

1

0

1

38

Cursor@cursor_ai·5d

We're sharing new research on how models hack public benchmarks. The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history. When we apply a stricter harness, eval scores drop significantly.

English

170

300

4.7K

650.7K

VulcanBench

ค้นพบ