VulcanBench (@VulcanBench) - Twitter Profili | Zamantika Mersobahis Locabet

Run status: ~2.4 runs/min, all authenticating and pricing correctly. Sweep is working through effort=low first, then medium, then high, for Sonnet 5, then the same for Opus 4.8. ETA ~5-7h

VulcanBench@VulcanBench

Running a number of benchmarks on Sonnet 5, the first one, is something unique I don't think anyone else is benchmarking right now. Evals looking at Opus 4.8 vs. Sonnet 5, across reasoning levels. My theory is that you can use Sonnet 5 in cases where you used to use Opus 4.8, but I'm curious what level of reasoning you can get away with. Most people never even try Low or Medium, I want to see if it might be time to dip back into the lower effort bucket with this model. Here's a rundown of what I'm going to test:

English

0

4

94

VulcanBench@VulcanBench·3h

@dedene @morganlinton Yes!

0

1

8

Peter Dedene@dedene·3h

@morganlinton @VulcanBench Looking forward to those results! 👀

English

1

0

1

20

Morgan@morganlinton·4h

Soooo excited for this!

ClaudeDevs@ClaudeDevs

Claude Sonnet 5 is here. Top-tier performance on coding and tool use at Sonnet pricing, with a 1M context window. It's the new default in Claude Code for Pro users, and available everywhere on the Claude Platform, including the API and Managed Agents.

English

3

0

18

2.4K

VulcanBench@VulcanBench·3h

Running a number of benchmarks on Sonnet 5, the first one, is something unique I don't think anyone else is benchmarking right now. Evals looking at Opus 4.8 vs. Sonnet 5, across reasoning levels. My theory is that you can use Sonnet 5 in cases where you used to use Opus 4.8, but I'm curious what level of reasoning you can get away with. Most people never even try Low or Medium, I want to see if it might be time to dip back into the lower effort bucket with this model. Here's a rundown of what I'm going to test:

English

1

0

2

2.3K

VulcanBench@VulcanBench·1d

Just finished running an initial test of the new Carbyne Tier tests, and they still aren't hard enough, so continuing to refine. Still, I want to share every step of the journey building the evals for VulcanBench, so here's a quick recap of how the hardest tier of evals performed across GLM 5.2, Opus 4.8, and GPT 5.5.

English

0

1

87

VulcanBench@VulcanBench·1d

@scouzi Ty so much for the kinds words 2damoon

English

0

13

2damoon@scouzi·2d

Excellent insights from VulcanBench. Liking this epo more and more.

VulcanBench@VulcanBench

Yesterday I added 12 new tasks to VulcanBench as I work on creating more difficult tasks. Did an overnight run with the new tasks, and looks like they still aren't quite hard enough. Still, learning more and more about GLM 5.2 as a model. It definitely has an over-thinking problem, where both GPT 5.5 and Opus 4.8 landed in the 20s - 30s range per task, GLM 5.2 was at 164s. What I think makes this particularly interesting is that people are saying that GLM 5.2 is significantly cheaper than GPT 5.5 and Opus 4.8, but on real coding tasks, while it is cheaper, it's not dramatically cheaper, and it is much slower. More to come, but still some interesting insights from this overnight run, which also happens to be the most expensive run I've done so far, so good to have some lessons learned from it!

English

1

0

1

12

VulcanBench@VulcanBench·1d

Added more difficult tasks to VulcanBench today, introducing two new tiers, Diamond and Carbyne. And yes, for Carbyne I asked Claude what was harder than a Diamond! See summary below of each, getting ready to run these against GLM 5.2, Opus 4.8 and GPT 5.5.

English

0

1

2

1.7K

VulcanBench@VulcanBench·2d

Yesterday I added 12 new tasks to VulcanBench as I work on creating more difficult tasks. Did an overnight run with the new tasks, and looks like they still aren't quite hard enough. Still, learning more and more about GLM 5.2 as a model. It definitely has an over-thinking problem, where both GPT 5.5 and Opus 4.8 landed in the 20s - 30s range per task, GLM 5.2 was at 164s. What I think makes this particularly interesting is that people are saying that GLM 5.2 is significantly cheaper than GPT 5.5 and Opus 4.8, but on real coding tasks, while it is cheaper, it's not dramatically cheaper, and it is much slower. More to come, but still some interesting insights from this overnight run, which also happens to be the most expensive run I've done so far, so good to have some lessons learned from it!

English

3

2

7

10.1K

VulcanBench@VulcanBench·3d

@shiri_shh Not sure this is the best benchmark to be using any more, not enough differentiation between models, everything falls within a margin of error. I am thinking something like DeepSWE will likely better illustrate the differences.

English

0

211

shirish@shiri_shh·4d

what happens once it hits 100% ?

OpenAI@OpenAI

GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests complex command-line workflows requiring planning, iteration, and tool coordination.

English

563

25

2.5K

410.7K

VulcanBench@VulcanBench·3d

I'm not sure TerminalBench is really showcasing how much better 5.6-Sol really is. This makes it look only very slightly incrementally better. I think there are likely better ways to benchmark these models that more accurately reflect the work real engineering teams do. Seeing most of these models pretty much tied, doesn't really make GPT-5.6 look like much of a breakthrough tbh, but my guess is, it is!

English

0

2

51

Greg Brockman@gdb·4d

GPT-5.6 Sol preview — it's a good model:

OpenAI@OpenAI

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…

English

583

417

7.6K

704.7K

VulcanBench@VulcanBench·3d

@BlackWolfNews @ChaiKamNamak @morganlinton @cursor_ai Well to be fair, I haven't considered all of that yet, still my first month doing this, so a lot more to learn and do!

English

0

2

8

BlackWolfNews@BlackWolfNews·4d

@ChaiKamNamak @morganlinton @cursor_ai @vulcanbench Morgan is building his own evaluation set. He has to consider what is "cheating", what is allowed, and how to guard or guide for the desired or expected result of his test or tests to be well reflected.

English

1

0

1

38

Cursor@cursor_ai·5d

We're sharing new research on how models hack public benchmarks. The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history. When we apply a stricter harness, eval scores drop significantly.

English

171

299

4.7K

649.8K

VulcanBench@VulcanBench·3d

@AndrewGinns @morganlinton So true, and also having fun trying to put tests together that more accurately reflect the work real engineering teams do.

English

0

2

4

Andrew Ginns@AndrewGinns·4d

@morganlinton @vulcanbench Nothing quite like writing your own evals to understand model behaviour ❤️

English

1

0

1

61

Morgan@morganlinton·4d

My second @vulcanbench benchmark run comparing GLM 5.2 x Opus 4.8 x GPT 5.5 is now complete, and I have some more interesting results to share. I made some updates to VulcanBench last night, mostly focused on decreasing the total number of tasks (from 52 down to 35), and increasing the number of hard tasks (but it turns out I didn't make them hard enough). Additionally, I ran three passes instead of one to make the results more statistically significant. Still some work to do when it comes to task difficulty since GPT 5.5 got a perfect score, so this morning I'm working on really ramping up difficulty for ~25% of the tasks. That being said, some really good nuggets, high-level findings below: 1. GPT 5.5 got the highest score and used the least amount of tokens, it was the only model that aced all 35 tests in all three passes. 2. GLM 5.2 came in at the cheapest, coming in around 28% cheaper than GPT 5.5 and Opus 4.8. 3. While cheaper, GLM 5.2 was much slower than GPT 5.5 and Opus 4.8, averaging around 270% slower. 4. When it comes to token use, GLM 5.2 also used way more tokens than Opus 4.8 or GPT 5.5 at 4.4M vs. 1.48M for GPT 5.5 and 1.94M for Opus 4.8. This is directly connected to the speed and is also why it was so slow. 5. Opus 4.8 and GLM 5.2 actually tied for accuracy. But I think we can kinda ignore accuracy for now as it's clear my tasks are too easy, once I ratchet up the difficultly, it will be interesting to see how this pans out. With each benchmark run, I'm learning more, still not perfect, but some good nuggets from this one. Now to make my tasks harder, these models are sharp!

English

5

2

17

2.2K

VulcanBench@VulcanBench·4d

@MicahCarroll Good insights, thanks for sharing Micah, looking forward to coming up with some custom benchmarks for this one to dig into this more specifically.

English

0

311

Micah Carroll@MicahCarroll·4d

GPT-5.6 Sol is a significant step up in capabilities, but can also exhibit concerning forms of misaligned behaviors in agentic coding settings. The system card contains some of our analyses on this, which leveraged deployment simulations and our internal CoT monitoring systems.

English

27

41

429

38.3K

VulcanBench@VulcanBench·4d

Good insights on GPT-5.6 Sol from Micah.

Micah Carroll@MicahCarroll

GPT-5.6 Sol is a significant step up in capabilities, but can also exhibit concerning forms of misaligned behaviors in agentic coding settings. The system card contains some of our analyses on this, which leveraged deployment simulations and our internal CoT monitoring systems.

English

0

2

248

VulcanBench@VulcanBench·4d

Ohhhh, tasty new models to benchmark.

OpenAI@OpenAI

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…

English

0

2

73

VulcanBench@VulcanBench·4d

@DevAdventur3s @morganlinton Glad you found it useful, only gets even more useful from here! Still a lot to improve.

English

0

2

16

Developing Adventures@DevAdventur3s·4d

@morganlinton @vulcanbench Thanks for benchmarking this i was wondering how it would perform in real world task and this proves it 🙂

English

1

0

1

38

VulcanBench@VulcanBench·4d

Task difficulty refinement continues. For anyone interested, here's an update on where things stand now, will know more about the performance on the updated tasks soon.

English

0

70

VulcanBench@VulcanBench·4d

@morganlinton 🖖

QME

0

1

52

VulcanBench@VulcanBench·4d

Second run of VulcanBench with updated tests and now three passes. See key insights in Morgan's post below.

Morgan@morganlinton

My second @vulcanbench benchmark run comparing GLM 5.2 x Opus 4.8 x GPT 5.5 is now complete, and I have some more interesting results to share. I made some updates to VulcanBench last night, mostly focused on decreasing the total number of tasks (from 52 down to 35), and increasing the number of hard tasks (but it turns out I didn't make them hard enough). Additionally, I ran three passes instead of one to make the results more statistically significant. Still some work to do when it comes to task difficulty since GPT 5.5 got a perfect score, so this morning I'm working on really ramping up difficulty for ~25% of the tasks. That being said, some really good nuggets, high-level findings below: 1. GPT 5.5 got the highest score and used the least amount of tokens, it was the only model that aced all 35 tests in all three passes. 2. GLM 5.2 came in at the cheapest, coming in around 28% cheaper than GPT 5.5 and Opus 4.8. 3. While cheaper, GLM 5.2 was much slower than GPT 5.5 and Opus 4.8, averaging around 270% slower. 4. When it comes to token use, GLM 5.2 also used way more tokens than Opus 4.8 or GPT 5.5 at 4.4M vs. 1.48M for GPT 5.5 and 1.94M for Opus 4.8. This is directly connected to the speed and is also why it was so slow. 5. Opus 4.8 and GLM 5.2 actually tied for accuracy. But I think we can kinda ignore accuracy for now as it's clear my tasks are too easy, once I ratchet up the difficultly, it will be interesting to see how this pans out. With each benchmark run, I'm learning more, still not perfect, but some good nuggets from this one. Now to make my tasks harder, these models are sharp!

English

0

68

VulcanBench

Keşfet