VulcanBench

381 posts

VulcanBench

@vulcanbench

Benchmarking LLMs @, focused on real world tests, large codebases, open source, full transparency.

Entrou em Mart 2020

73 Seguindo216 Seguidores

VulcanBench@vulcanbench·2d

@prz_chojecki For Opus 4.8, what effort level was used for this benchmark?

English

Przemek Chojecki | PC@prz_chojecki·3d

Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs. Kimi 2.7 is amazingly good. More below.

English

163

538

1.8M

VulcanBench@vulcanbench·2d

Need to benchmark Kimi.

Przemek Chojecki | PC@prz_chojecki

English

VulcanBench@vulcanbench·3d

Interesting evening.

ClaudeDevs@ClaudeDevs

As a result of a US government directive, we are suspending access to Claude Fable 5 for all users. You can continue to use all other Claude models. Here’s what this means for you: Across Claude products, new sessions will run on your selected default model or Opus 4.8, and existing Fable 5 sessions will end with an error. On the Claude Platform, requests to Fable 5 will also return an error. Please update your integrations to other Claude models. We know this is a disruption to your workflows; we appreciate your patience and support.

English

VulcanBench@vulcanbench·3d

@theo This is a good one Theo, thanks for putting it together.

English

Theo - t3.gg@theo·3d

We have Mythos in our Claude Code subs for 10 more days. I made a video about tokemaxxing so we can get the most out of it.

English

474

164.2K

VulcanBench@vulcanbench·3d

@rauchg Yes it is.

English

Guillermo Rauch@rauchg·4d

HTML is so back. Drag and vercel.com/drop

Vercel Developers@vercel_dev

Drop It. It's Live. Drag a file or folder into your browser and Vercel Drop gives you a production URL in seconds. vercel.com/changelog/verc…

English

1.8K

306.1K

VulcanBench@vulcanbench·3d

A bit more of the thought process behind building Vulcan Bench. Currently testing workflows like this, here's an example below, here's how you could compare Opus 4.8 with GPT 5.5 at different reasoning levels: Step 1: vulcanbench run --suite v1 --model anthropic:claude-opus-4-8 --effort low --repeat 5 Step 2: vulcanbench run --suite v1 --model openai:gpt-5.5 --effort medium --repeat 5 Step 3: vulcanbench leaderboard

Morgan@morganlinton

I am starting to realize more and more that we can’t just look at, and benchmark models without comparing different effort levels. Fable 5 is what pushed me to think about this more. I’m finding Fable 5 in low and medium effort, produces the same or better output than a lot of other models at high and xhigh. At the same time, I’m experimenting with just normal routine tasks, and finding even Fable low is overkill. There are soooo many tasks that Grok Build, Composer 2.5, SWE-1.6, GLM 5.1, and other models can do, at the exact same accuracy level as Fable. And that’s comparing to Fable low, on tasks that Fable Max produces the exact same output. Yes, increasing thinking depth doesn’t mean it gets it more right, sometimes small and medium problems don’t need the most bleeding-edge frontier model in the world to reach the optimal solution. We keep benchmarking models all at the same effort levels, and I think that could be a mistake. We need to look at effort as another key variable, and optimize for a combination of model and effort, coupled with task complexity and codebase size. This is one of the things I’m thinking through more deeply with @vulcanbench which I’m going to release, open source, this weekend.

English

1.2K

VulcanBench@vulcanbench·5d

@kieranklaassen Super interesting, thanks for sharing Kieran.

English

Kieran Klaassen@kieranklaassen·9 Haz

fable-5 broke my benchmarks. here is some cool stuff I found. Ran LFGBench across 12 prompts, here are a few:

Claude@claudeai

Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.

English

10.8K

VulcanBench@vulcanbench·5d

@karpathy Incredible benchmarks.

English

Andrej Karpathy@karpathy·9 Haz

This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also, this is a major-version-bump-deserving step change forward (imo of the same order as Claude 4.5 was in November), peaking especially for long problem-solving sessions on very difficult problems. You can give it a lot more ambitious tasks than what you're used to, the model "gets it" and it will just go, and it's never felt this tempting to stop looking at the code at all (but don't do this in prod!). The model still has quirks that people will run into and the safeguards are configured to be a little too trigger happy for launch, which can hopefully be tuned over time. I feel a lot of things changing as working software increasingly comes out on a tap. The Jevon's paradox kicks in and I feel my own demand for software growing substantially. You can ask for anything - explainers, visualizers, dashboards, bespoke single-use apps (e.g. a full wandb that is hyper-specific just for your project), you can 10X your test suite, auto-optimize code, run giant research projects with custom HTML for the results, anything! "Free your mind" (Matrix ref). Really looking forward to all the things people build!

Claude@claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

English

1.3K

2.4K

25.3K

2.7M

VulcanBench@vulcanbench·5d

@theo Yes!

Theo - t3.gg@theo·5d

We need more niche benches. We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench. We need way more creativity in how we measure what models can do.

English

226

2.5K

124.3K

VulcanBench retweetou

Morgan@morganlinton·19 Nis

Okay, so I've come to the conclusion that I need a 3090, like yesterday. Really want to run more powerful LLMs locally at home. Relatively new territory for me, I'm far from an expert, so was chatting with Perplexity about it. Here's what it thinks I need, hoping someone like @0xSero or @LottoLabs and let me know how right, or wrong it is, and what I really need. Trying not to break the bank so I'm okay starting small(ish) 🤏

English

8.1K

VulcanBench@vulcanbench·18 Şub

This game looks absolutely beautiful.

Planet of Lana II - Out Now! 🍃@PlanetofLana

We poured our hearts into the hand-painted horizons of Planet of Lana II, our love letter to the sweeping scales and quiet wonder of classic Ghibli adventures. ✨ A new odyssey of friendship and mystery awaits. Lana and Mui are ready. Are you? 🐾 #indiegame #PlanetofLana

English

VulcanBench@vulcanbench·18 Şub

@PlanetofLana Phew, absolutely love the visual style here.

English

Planet of Lana II - Out Now! 🍃@PlanetofLana·17 Şub

English

148

906

29.6K

VulcanBench@vulcanbench·18 Şub

@gelius__ I'll be here lurking, taking notes.

English

GELIUS 🎮@gelius__·18 Şub

It's #WishlistWednesday 🎮 Show me what you got #gamedev 👀 Let's see what #IndieGame you're preparing! 🔜

English

106

3.7K

VulcanBench@vulcanbench·18 Şub

@protzz_ Ohhh, love this idea.

English

protzz👾@protzz_·17 Şub

Balatro proved poker could be a roguelike. I’m proving chess can too. #indiedev #indiegame #indie #gamedev #game #indiegamedev

English

1.2K

85.1K

VulcanBench@vulcanbench·17 Şub

Okay, kicking off an all night feature build.

English

1.2K

VulcanBench@vulcanbench·17 Şub

@clemmygames Yesssss

Best Indie Games@clemmygames·15 Şub

4 more indie games you should experience at least once.

Best Indie Games@clemmygames

4 more indie games you should experience at least once.

English

107

1.6K

96.4K

VulcanBench@vulcanbench·17 Şub

@MrGemezl Looking good!

English

Gemezl@MrGemezl·17 Şub

Sorry for not posting any dev updates recently. Right now I'm mostly preparing stuff for the coming Steam Next Fest :)

English

365

10.7K

VulcanBench retweetou

Lya Mgtt ✧ Indie Game Dev@Lya_Mgtt·16 Şub

Hey Game Devs! 🌸 I'm Lya a cozy game dev, my objective is to try out as many demo as possible during the Steam Next Fest! If your indie game is participating and you want a peer feedback, don't hesitate to drop your demo link bellow ✨🚀 #SteamNextFest #NextFest #indiegames

English

116

4.8K

VulcanBench@vulcanbench·16 Şub

Support indie game devs.

Morgan@morganlinton

I started learning @unity about ten years ago. It's an incredible game engine. Built five small(ish) games, just for myself, nothing I've been proud enough to share publicly. Some day I'd love to have the time to build a game that I'm proud of enough to share with the world. But as the founder of a software company, my days, nights, and weekends are spent with our amazing team, investors, and clients, and I wouldn't have it any other way. That being said, I love playing games, and continue tinkering in Unity, not because I want to make money, or get a bunch of users, but just because I love games. And for those wondering, no vibe coding in Unity yet, I'm old school, still do all my coding in Unity by hand, but likely going to play around with Codex and Opus to see what they can do. I've spent thousands of dollars on games over the years, plan to spend thousands more, and always like to put more money into indie games, because those devs are my heroes. If you go to @Official_GDC this year, make sure to spend a ton of time in the indie game section, that's my favorite spot, I usually spend 90% of my time there. Here's a photo I took at GDC back in 2022, indie game dev, walking around with a laptop, super cool game, so much fun.

English

VulcanBench@vulcanbench·16 Şub

Just finished a new feature build for Terminal Forge. And I'm going to make Terminal Forge Open Source, just need to get v1 built and running smoothly first. Here's what's new in this update. Hardened deterministic multiplayer with chaos transport and recovery. Added a full lockstep hardening pass across transport simulation, host/client resilience, replay validation, demos, tests, and docs. - Add configurable in-memory network simulation with latency/drop/duplicate/reorder, seeded determinism, and packet stats. - Extended host lockstep runtime with: - frame timeout fallback for missing remote inputs - ACK-gap frame resend and snapshot fallback for stale clients - richer host events/metrics and stricter peer validation - host-input frame gating (never fabricate host input) - Extended client runtime with: - optional input delay - frame-gap timeout resync requests - checksum mismatch events and richer client metrics - Strengthen replay verification with: - tape version validation - frame sequence continuity checks - missing-player-input detection - Added new chaos sample demo for lossy-network lockstep convergence. - Added/expanded lockstep tests for sync, desync, rejoin/resync, timeout fallback, resend recovery, duplicate delivery stats, and replay sequence validation. - Updated exports and runtime compatibility: - explicit lockstep barrel exports to avoid value-import loss in tool runtime rewrites - direct-execution guards for lockstep demos - Update docs and reporting: - README lockstep toolkit section + demo commands + project layout updates - engineering report addendum for the lockstep hardening pass - add demo script for `demo:lockstep-chaos` Validation: - npm test (48/48 passing) - npm run demo:lockstep - npm run demo:lockstep-chaos

English

635

Descobrir

@prz_chojecki @theo @rauchg @kieranklaassen @karpathy @0xSero @LottoLabs @PlanetofLana