
MineBench
87 posts

MineBench
@minebench_ai
An open-source, 3D spatial-reasoning benchmark for evaluating LLMs Help fund the benchmark here: https://t.co/jm3vzwDKcY









A user on Reddit shared a MineBench comparison between Claude Fable 5 and Opus 4.8, and the results are honestly pretty interesting. Fable 5 averaged ~18 mins inference time, while Opus 4.8 took almost ~25 mins on average. Which is funny because on Claude, Fable actually feels slower and like it thinks forever The cost numbers were interesting too: Fable 5 → $54.93 for 15 builds Opus 4.8 → $41.52 for 15 builds And this is despite Fable’s API pricing being 2x more expensive than Opus 4.8. So the benchmark creator thinks Fable is probably generating way fewer tokens overall, which is helping keep the cost relatively lower. What’s also interesting is that the builds apparently weren’t some gigantic leap over GPT 5.5 Pro visually, but Fable showed insane attention to tiny details. One example was a Pac-Man arcade build where it correctly added the game screen, score counter, and even the “1UP” label Also apparently adding prompts like: “LEVEL OF DETAIL: MAXIMUM” “BOUNDING BOX: UNLIMITED” improved the outputs a lot. Benchmarks are slowly becoming half model eval and half prompting skill issue at this point.



Labs really should start putting MineBench scores on their launch posts, here's Fable 5 vs Opus -


Sneak peek at a new model about to finish benchmarking on MineBench.ai 👀 (this one might be easy to guess 🥱)









