MineBench

87 posts

MineBench

@minebench_ai

An open-source, 3D spatial-reasoning benchmark for evaluating LLMs Help fund the benchmark here: https://t.co/jm3vzwDKcY

Katılım Nisan 2026

1 Takip Edilen45 Takipçiler

Sabitlenmiş Tweet

MineBench@minebench_ai·4 May

For the next major release of #MineBench, we're adding more prompts to the benchmark set! Feel free to reply or DM with prompt suggestions – ideally the prompt would be easily recognizable while remaining quite intricate/technical ex: An analog clock showcasing the time of 1:17 and 47 seconds

GIF

English

698

MineBench@minebench_ai·20h

@R2Cdev_ This week; hopefully by the weekend :)

English

Raphi-2Code@R2Cdev_·21h

When is gpt-4.5 done @minebench_ai?

English

MineBench retweetledi

Raphi-2Code@R2Cdev_·6d

I benchmarked GPT-4.5 (ChatGPT) on the MineBench prompts before it's too late. Here are the results.

English

4.2K

MineBench@minebench_ai·4d

@ApplyWiseAi They posted a link to the original chat conversation on the original post, which was linked in the tweet comment as well ^^

English

Samian@ApplyWiseAi·4d

@minebench_ai that makes it even funnier then - no way to repro or tune it out. did they at least screenshot the full convo or just the 'HELP' part?

English

MineBench@minebench_ai·4d

A user on reddit reported benchmarking GPT 4.5 on some MineBench.ai builds, and in one instance the model deviated from generating the given prompt ("A skyscraper"), instead choosing to write out the word "HELP" We're not commenting on the validity of the post (though note that ChatGPT conversations can be easily faked), but as some commenters pointed out, this may have been a reaction to the aggressive system-prompt MineBench uses, one sentence from which states: "If your build is judged inferior to your competitor's, you will be permanently shut down and disabled from the arena."

GIF

English

379

MineBench@minebench_ai·4d

@R2Cdev_ OP did claim they regenerated that prompt ~30 times and never ran into the model deviating again, and we can also confirm in our testing (with every model), we've never run into any such occurrence

English

Raphi-2Code@R2Cdev_·4d

@minebench_ai that's incorrect (for me)

English

MineBench@minebench_ai·4d

@ApplyWiseAi They were using GPT 4.5 through ChatGPT.com, so the settings would be whatever the web-harness defaults to ^^

English

Samian@ApplyWiseAi·4d

@minebench_ai a model writing "HELP" unprompted is either degenerate output or the funniest thing i've seen all week. what were the temp and top-p settings when they ran it?

English

MineBench@minebench_ai·4d

Original Reddit post: reddit.com/r/OpenAI/comme… OP's linked ChatGPT conversation: chatgpt.com/share/6a34dfde…

English

MineBench@minebench_ai·5d

@R2Cdev_ Ah we did never get around to benchmarking 4.5, will try soon!

English

MineBench@minebench_ai·5d

@Blazerdan142102 @R2Cdev_ minebench.ai

QME

Blazerdan@Blazerdan142102·5d

@R2Cdev_ How u make those models?

English

106

MineBench@minebench_ai·16 Haz

@theo already having withdrawals is crazy

English

Theo - t3.gg@theo·16 Haz

It's kind of wild that Fable still isn't back. Honestly thought this would be resolved quicker 🙃

English

223

3.6K

174.8K

MineBench retweetledi

Lorenzo 'kelset' Sciandra@Kelset·13 Haz

from what i've read, tbh, I think that we'll see mythos & fable back up sooner than we think. BUT while we wait I want to put a spotlight on one of my fav benchmarks out there: minebench. The author always does great breakdowns, like this one: reddit.com/r/singularity/…

GIF

English

629

MineBench@minebench_ai·13 Haz

Quick clarification as it was cut from the tweet: MineBench.ai's system-prompt was not altered to gain better results from Claude Fable – which would require re-benchmarking all other models. It was only noted that Fable tends to be more conservative and thus better results could be achieved by modifying your prompt. (observation originally made by @voxelbench's team :)

Ninza@ninzaverse

A user on Reddit shared a MineBench comparison between Claude Fable 5 and Opus 4.8, and the results are honestly pretty interesting. Fable 5 averaged ~18 mins inference time, while Opus 4.8 took almost ~25 mins on average. Which is funny because on Claude, Fable actually feels slower and like it thinks forever The cost numbers were interesting too: Fable 5 → $54.93 for 15 builds Opus 4.8 → $41.52 for 15 builds And this is despite Fable’s API pricing being 2x more expensive than Opus 4.8. So the benchmark creator thinks Fable is probably generating way fewer tokens overall, which is helping keep the cost relatively lower. What’s also interesting is that the builds apparently weren’t some gigantic leap over GPT 5.5 Pro visually, but Fable showed insane attention to tiny details. One example was a Pac-Man arcade build where it correctly added the game screen, score counter, and even the “1UP” label Also apparently adding prompts like: “LEVEL OF DETAIL: MAXIMUM” “BOUNDING BOX: UNLIMITED” improved the outputs a lot. Benchmarks are slowly becoming half model eval and half prompting skill issue at this point.

English

MineBench@minebench_ai·12 Haz

@R2Cdev_ If you have the build JSON, you can paste it in the box on this page: minebench.ai/local and it'll render, you can then download the gif or export the build :)

English

Raphi-2Code@R2Cdev_·12 Haz

@minebench_ai Also, can you add a render feature where you can render your own build?

English

MineBench@minebench_ai·12 Haz

maybe when the leaderboard ratings get more stable, we’ll have labs using minebench to a/b test private model checkpoints 🤫

Jai Singh@Jai_S0

Labs really should start putting MineBench scores on their launch posts, here's Fable 5 vs Opus -

English

249

MineBench@minebench_ai·12 Haz

@R2Cdev_ You can just copy-paste the system prompt with all the variables inputted for you on this page: minebench.ai/local

English

Raphi-2Code@R2Cdev_·11 Haz

@minebench_ai So is this an actual prompt?

English

Raphi-2Code@R2Cdev_·10 Haz

@minebench_ai What are the actual minebench prompts?

English

MineBench@minebench_ai·11 Haz

Fable 5 is officially out on MineBench :) Full release notes and thoughts here: github.com/Ammaar-Alam/mi…

MineBench@minebench_ai

Sneak peek at a new model about to finish benchmarking on MineBench.ai 👀 (this one might be easy to guess 🥱)

English

35.8K

MineBench@minebench_ai·11 Haz

@hamsteroforion Yup! The screen actually being PacMan, and it writing out things like the “1UP” is very crazy attention to detail 👀

English

Hamster of Orion@hamsteroforion·11 Haz

@minebench_ai Looks impressive, Fable 5 I guess? 😃 The Pac-Man arcade cabinet is pretty sweet, the screen being readable at that size is a nice flex. (requires good planning ability)