
Sonnet 5 medium is better than GLM 5.2 high and roughly the same price hilarious tbh
VulcanBench
450 posts

@VulcanBench
Open Source LLM benchmarking tool, focused on real world tests, large codebases, full transparency. An Open Source project by @morganlinton.

Sonnet 5 medium is better than GLM 5.2 high and roughly the same price hilarious tbh



Running a number of benchmarks on Sonnet 5, the first one, is something unique I don't think anyone else is benchmarking right now. Evals looking at Opus 4.8 vs. Sonnet 5, across reasoning levels. My theory is that you can use Sonnet 5 in cases where you used to use Opus 4.8, but I'm curious what level of reasoning you can get away with. Most people never even try Low or Medium, I want to see if it might be time to dip back into the lower effort bucket with this model. Here's a rundown of what I'm going to test:

Claude Sonnet 5 is here. Top-tier performance on coding and tool use at Sonnet pricing, with a 1M context window. It's the new default in Claude Code for Pro users, and available everywhere on the Claude Platform, including the API and Managed Agents.











Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…








GPT-5.6 Sol is a significant step up in capabilities, but can also exhibit concerning forms of misaligned behaviors in agentic coding settings. The system card contains some of our analyses on this, which leveraged deployment simulations and our internal CoT monitoring systems.

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…



My second @vulcanbench benchmark run comparing GLM 5.2 x Opus 4.8 x GPT 5.5 is now complete, and I have some more interesting results to share. I made some updates to VulcanBench last night, mostly focused on decreasing the total number of tasks (from 52 down to 35), and increasing the number of hard tasks (but it turns out I didn't make them hard enough). Additionally, I ran three passes instead of one to make the results more statistically significant. Still some work to do when it comes to task difficulty since GPT 5.5 got a perfect score, so this morning I'm working on really ramping up difficulty for ~25% of the tasks. That being said, some really good nuggets, high-level findings below: 1. GPT 5.5 got the highest score and used the least amount of tokens, it was the only model that aced all 35 tests in all three passes. 2. GLM 5.2 came in at the cheapest, coming in around 28% cheaper than GPT 5.5 and Opus 4.8. 3. While cheaper, GLM 5.2 was much slower than GPT 5.5 and Opus 4.8, averaging around 270% slower. 4. When it comes to token use, GLM 5.2 also used way more tokens than Opus 4.8 or GPT 5.5 at 4.4M vs. 1.48M for GPT 5.5 and 1.94M for Opus 4.8. This is directly connected to the speed and is also why it was so slow. 5. Opus 4.8 and GLM 5.2 actually tied for accuracy. But I think we can kinda ignore accuracy for now as it's clear my tasks are too easy, once I ratchet up the difficultly, it will be interesting to see how this pans out. With each benchmark run, I'm learning more, still not perfect, but some good nuggets from this one. Now to make my tasks harder, these models are sharp!