VulcanBench

381 posts

VulcanBench banner
VulcanBench

VulcanBench

@vulcanbench

Benchmarking LLMs @, focused on real world tests, large codebases, open source, full transparency.

Entrou em Mart 2020
73 Seguindo216 Seguidores
Przemek Chojecki | PC
Przemek Chojecki | PC@prz_chojecki·
Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs. Kimi 2.7 is amazingly good. More below.
Przemek Chojecki | PC tweet media
English
163
538
5K
1.8M
VulcanBench
VulcanBench@vulcanbench·
@theo This is a good one Theo, thanks for putting it together.
English
0
0
0
6
Theo - t3.gg
Theo - t3.gg@theo·
We have Mythos in our Claude Code subs for 10 more days. I made a video about tokemaxxing so we can get the most out of it.
English
59
16
474
164.2K
VulcanBench
VulcanBench@vulcanbench·
A bit more of the thought process behind building Vulcan Bench. Currently testing workflows like this, here's an example below, here's how you could compare Opus 4.8 with GPT 5.5 at different reasoning levels: Step 1: vulcanbench run --suite v1 --model anthropic:claude-opus-4-8 --effort low --repeat 5 Step 2: vulcanbench run --suite v1 --model openai:gpt-5.5 --effort medium --repeat 5 Step 3: vulcanbench leaderboard
Morgan@morganlinton

I am starting to realize more and more that we can’t just look at, and benchmark models without comparing different effort levels. Fable 5 is what pushed me to think about this more. I’m finding Fable 5 in low and medium effort, produces the same or better output than a lot of other models at high and xhigh. At the same time, I’m experimenting with just normal routine tasks, and finding even Fable low is overkill. There are soooo many tasks that Grok Build, Composer 2.5, SWE-1.6, GLM 5.1, and other models can do, at the exact same accuracy level as Fable. And that’s comparing to Fable low, on tasks that Fable Max produces the exact same output. Yes, increasing thinking depth doesn’t mean it gets it more right, sometimes small and medium problems don’t need the most bleeding-edge frontier model in the world to reach the optimal solution. We keep benchmarking models all at the same effort levels, and I think that could be a mistake. We need to look at effort as another key variable, and optimize for a combination of model and effort, coupled with task complexity and codebase size. This is one of the things I’m thinking through more deeply with @vulcanbench which I’m going to release, open source, this weekend.

English
1
1
1
1.2K
Andrej Karpathy
Andrej Karpathy@karpathy·
This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also, this is a major-version-bump-deserving step change forward (imo of the same order as Claude 4.5 was in November), peaking especially for long problem-solving sessions on very difficult problems. You can give it a lot more ambitious tasks than what you're used to, the model "gets it" and it will just go, and it's never felt this tempting to stop looking at the code at all (but don't do this in prod!). The model still has quirks that people will run into and the safeguards are configured to be a little too trigger happy for launch, which can hopefully be tuned over time. I feel a lot of things changing as working software increasingly comes out on a tap. The Jevon's paradox kicks in and I feel my own demand for software growing substantially. You can ask for anything - explainers, visualizers, dashboards, bespoke single-use apps (e.g. a full wandb that is hyper-specific just for your project), you can 10X your test suite, auto-optimize code, run giant research projects with custom HTML for the results, anything! "Free your mind" (Matrix ref). Really looking forward to all the things people build!
Claude@claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

English
1.3K
2.4K
25.3K
2.7M
Theo - t3.gg
Theo - t3.gg@theo·
We need more niche benches. We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench. We need way more creativity in how we measure what models can do.
English
226
63
2.5K
124.3K
VulcanBench retweetou
Morgan
Morgan@morganlinton·
Okay, so I've come to the conclusion that I need a 3090, like yesterday. Really want to run more powerful LLMs locally at home. Relatively new territory for me, I'm far from an expert, so was chatting with Perplexity about it. Here's what it thinks I need, hoping someone like @0xSero or @LottoLabs and let me know how right, or wrong it is, and what I really need. Trying not to break the bank so I'm okay starting small(ish) 🤏
Morgan tweet media
English
34
2
60
8.1K
Planet of Lana II - Out Now! 🍃
Planet of Lana II - Out Now! 🍃@PlanetofLana·
We poured our hearts into the hand-painted horizons of Planet of Lana II, our love letter to the sweeping scales and quiet wonder of classic Ghibli adventures. ✨ A new odyssey of friendship and mystery awaits. Lana and Mui are ready. Are you? 🐾 #indiegame #PlanetofLana
English
38
148
906
29.6K
VulcanBench
VulcanBench@vulcanbench·
Okay, kicking off an all night feature build.
VulcanBench tweet media
English
0
1
2
1.2K
Gemezl
Gemezl@MrGemezl·
Sorry for not posting any dev updates recently. Right now I'm mostly preparing stuff for the coming Steam Next Fest :)
English
6
9
365
10.7K
VulcanBench retweetou
Lya Mgtt ✧ Indie Game Dev
Lya Mgtt ✧ Indie Game Dev@Lya_Mgtt·
Hey Game Devs! 🌸 I'm Lya a cozy game dev, my objective is to try out as many demo as possible during the Steam Next Fest! If your indie game is participating and you want a peer feedback, don't hesitate to drop your demo link bellow ✨🚀 #SteamNextFest #NextFest #indiegames
Lya Mgtt ✧ Indie Game Dev tweet media
English
64
11
116
4.8K
VulcanBench
VulcanBench@vulcanbench·
Support indie game devs.
Morgan@morganlinton

I started learning @unity about ten years ago. It's an incredible game engine. Built five small(ish) games, just for myself, nothing I've been proud enough to share publicly. Some day I'd love to have the time to build a game that I'm proud of enough to share with the world. But as the founder of a software company, my days, nights, and weekends are spent with our amazing team, investors, and clients, and I wouldn't have it any other way. That being said, I love playing games, and continue tinkering in Unity, not because I want to make money, or get a bunch of users, but just because I love games. And for those wondering, no vibe coding in Unity yet, I'm old school, still do all my coding in Unity by hand, but likely going to play around with Codex and Opus to see what they can do. I've spent thousands of dollars on games over the years, plan to spend thousands more, and always like to put more money into indie games, because those devs are my heroes. If you go to @Official_GDC this year, make sure to spend a ton of time in the indie game section, that's my favorite spot, I usually spend 90% of my time there. Here's a photo I took at GDC back in 2022, indie game dev, walking around with a laptop, super cool game, so much fun.

English
0
0
1
45
VulcanBench
VulcanBench@vulcanbench·
Just finished a new feature build for Terminal Forge. And I'm going to make Terminal Forge Open Source, just need to get v1 built and running smoothly first. Here's what's new in this update. Hardened deterministic multiplayer with chaos transport and recovery. Added a full lockstep hardening pass across transport simulation, host/client resilience, replay validation, demos, tests, and docs. - Add configurable in-memory network simulation with latency/drop/duplicate/reorder, seeded determinism, and packet stats. - Extended host lockstep runtime with: - frame timeout fallback for missing remote inputs - ACK-gap frame resend and snapshot fallback for stale clients - richer host events/metrics and stricter peer validation - host-input frame gating (never fabricate host input) - Extended client runtime with: - optional input delay - frame-gap timeout resync requests - checksum mismatch events and richer client metrics - Strengthen replay verification with: - tape version validation - frame sequence continuity checks - missing-player-input detection - Added new chaos sample demo for lossy-network lockstep convergence. - Added/expanded lockstep tests for sync, desync, rejoin/resync, timeout fallback, resend recovery, duplicate delivery stats, and replay sequence validation. - Updated exports and runtime compatibility: - explicit lockstep barrel exports to avoid value-import loss in tool runtime rewrites - direct-execution guards for lockstep demos - Update docs and reporting: - README lockstep toolkit section + demo commands + project layout updates - engineering report addendum for the lockstep hardening pass - add demo script for `demo:lockstep-chaos` Validation: - npm test (48/48 passing) - npm run demo:lockstep - npm run demo:lockstep-chaos
English
0
0
1
635