MineBench

87 posts

MineBench banner
MineBench

MineBench

@minebench_ai

An open-source, 3D spatial-reasoning benchmark for evaluating LLMs Help fund the benchmark here: https://t.co/jm3vzwDKcY

Katılım Nisan 2026
1 Takip Edilen45 Takipçiler
Sabitlenmiş Tweet
MineBench
MineBench@minebench_ai·
For the next major release of #MineBench, we're adding more prompts to the benchmark set! Feel free to reply or DM with prompt suggestions – ideally the prompt would be easily recognizable while remaining quite intricate/technical ex: An analog clock showcasing the time of 1:17 and 47 seconds
GIF
English
5
0
7
698
MineBench
MineBench@minebench_ai·
@R2Cdev_ This week; hopefully by the weekend :)
English
0
0
1
9
MineBench retweetledi
Raphi-2Code
Raphi-2Code@R2Cdev_·
I benchmarked GPT-4.5 (ChatGPT) on the MineBench prompts before it's too late. Here are the results.
Raphi-2Code tweet media
English
8
2
40
4.2K
MineBench
MineBench@minebench_ai·
@ApplyWiseAi They posted a link to the original chat conversation on the original post, which was linked in the tweet comment as well ^^
English
0
0
0
6
Samian
Samian@ApplyWiseAi·
@minebench_ai that makes it even funnier then - no way to repro or tune it out. did they at least screenshot the full convo or just the 'HELP' part?
English
1
0
0
15
MineBench
MineBench@minebench_ai·
A user on reddit reported benchmarking GPT 4.5 on some MineBench.ai builds, and in one instance the model deviated from generating the given prompt ("A skyscraper"), instead choosing to write out the word "HELP" We're not commenting on the validity of the post (though note that ChatGPT conversations can be easily faked), but as some commenters pointed out, this may have been a reaction to the aggressive system-prompt MineBench uses, one sentence from which states: "If your build is judged inferior to your competitor's, you will be permanently shut down and disabled from the arena."
GIF
English
3
0
10
379
MineBench
MineBench@minebench_ai·
@R2Cdev_ OP did claim they regenerated that prompt ~30 times and never ran into the model deviating again, and we can also confirm in our testing (with every model), we've never run into any such occurrence
English
1
0
2
22
MineBench
MineBench@minebench_ai·
@ApplyWiseAi They were using GPT 4.5 through ChatGPT.com, so the settings would be whatever the web-harness defaults to ^^
English
1
0
0
37
Samian
Samian@ApplyWiseAi·
@minebench_ai a model writing "HELP" unprompted is either degenerate output or the funniest thing i've seen all week. what were the temp and top-p settings when they ran it?
English
1
0
2
26
MineBench
MineBench@minebench_ai·
@R2Cdev_ Ah we did never get around to benchmarking 4.5, will try soon!
English
1
0
5
57
MineBench
MineBench@minebench_ai·
@theo already having withdrawals is crazy
English
0
0
0
49
Theo - t3.gg
Theo - t3.gg@theo·
It's kind of wild that Fable still isn't back. Honestly thought this would be resolved quicker 🙃
English
223
40
3.6K
174.8K
MineBench retweetledi
Lorenzo 'kelset' Sciandra
from what i've read, tbh, I think that we'll see mythos & fable back up sooner than we think. BUT while we wait I want to put a spotlight on one of my fav benchmarks out there: minebench. The author always does great breakdowns, like this one: reddit.com/r/singularity/…
GIF
English
1
1
4
629
MineBench
MineBench@minebench_ai·
Quick clarification as it was cut from the tweet: MineBench.ai's system-prompt was not altered to gain better results from Claude Fable – which would require re-benchmarking all other models. It was only noted that Fable tends to be more conservative and thus better results could be achieved by modifying your prompt. (observation originally made by @voxelbench's team :)
Ninza@ninzaverse

A user on Reddit shared a MineBench comparison between Claude Fable 5 and Opus 4.8, and the results are honestly pretty interesting. Fable 5 averaged ~18 mins inference time, while Opus 4.8 took almost ~25 mins on average. Which is funny because on Claude, Fable actually feels slower and like it thinks forever The cost numbers were interesting too: Fable 5 → $54.93 for 15 builds Opus 4.8 → $41.52 for 15 builds And this is despite Fable’s API pricing being 2x more expensive than Opus 4.8. So the benchmark creator thinks Fable is probably generating way fewer tokens overall, which is helping keep the cost relatively lower. What’s also interesting is that the builds apparently weren’t some gigantic leap over GPT 5.5 Pro visually, but Fable showed insane attention to tiny details. One example was a Pac-Man arcade build where it correctly added the game screen, score counter, and even the “1UP” label Also apparently adding prompts like: “LEVEL OF DETAIL: MAXIMUM” “BOUNDING BOX: UNLIMITED” improved the outputs a lot. Benchmarks are slowly becoming half model eval and half prompting skill issue at this point.

English
0
0
3
93
MineBench
MineBench@minebench_ai·
@R2Cdev_ If you have the build JSON, you can paste it in the box on this page: minebench.ai/local and it'll render, you can then download the gif or export the build :)
English
0
0
2
25
Raphi-2Code
Raphi-2Code@R2Cdev_·
@minebench_ai Also, can you add a render feature where you can render your own build?
Raphi-2Code tweet media
English
1
0
3
64
MineBench
MineBench@minebench_ai·
@hamsteroforion Yup! The screen actually being PacMan, and it writing out things like the “1UP” is very crazy attention to detail 👀
English
0
0
1
43
Hamster of Orion
Hamster of Orion@hamsteroforion·
@minebench_ai Looks impressive, Fable 5 I guess? 😃 The Pac-Man arcade cabinet is pretty sweet, the screen being readable at that size is a nice flex. (requires good planning ability)
Hamster of Orion tweet media
English
1
0
1
116
MineBench
MineBench@minebench_ai·
Sneak peek at a new model about to finish benchmarking on MineBench.ai 👀 (this one might be easy to guess 🥱)
GIF
GIF
GIF
GIF
English
4
1
11
50.8K
MineBench
MineBench@minebench_ai·
@AinaAiTech It's probably the model you're thinking of :)
English
0
0
0
41
Aina Ai | Tools & Updates
Aina Ai | Tools & Updates@AinaAiTech·
@minebench_ai Ooo love a good teaser Benchmark scores looking spicy I bet Easy to guess huh... dropping hints or should I take a shot in the dark?
English
1
0
1
400