stevibe@stevibe
I built a macOS app for benchmarking local LLMs.
6 test suites. Multiple providers. One workspace. Open source.
There are hundreds of local models now. New ones every week. How do you actually pick one?
Leaderboards test for general ability. But if you're building an agent that chains tool calls, or a pipeline that extracts structured data, or a code assistant that needs to debug Rust, you need to know if the model handles that specific thing. Not in theory. On your hardware. With your prompts.
The benchmarks that exist are either locked behind papers, too abstract to map to real failures, or impossible to extend. You can't add your own test cases. You can't test what matters to your use case.
That's what BenchLocal is for.
It's a benchmark platform where every test is practical, deterministic, and built around real-world tasks.
And you can build your own tests.
It ships with 6 Bench Packs TODAY:
→ ToolCall-15 — tool-use accuracy
→ BugFind-15 — debugging capabilities
→ DataExtract-15 — structured data extraction
→ InstructFollow-15 — constraint-heavy instruction following
→ ReasonMath-15 — practical reasoning and math
→ StructOutput-15 — validator-backed structured output
Every pack has 15 fixed scenarios. Every score is deterministic and verifiable.
Some of you saw ToolCall-15 and BugFind-15 — the individual test packs I open-sourced over the past few weeks. People ran them, filed issues, sent PRs. But managing separate repos, separate scripts, separate results doesn't scale. BenchLocal puts everything in one place.
What the app does:
> Workspace with tabs — run BugFind-15 in one tab, ToolCall-15 in another.
> Any provider — Ollama, llama.cpp, OpenRouter, any OpenAI-compatible endpoint. Local and cloud, same interface.
> Run modes — serial, batch per model, batch per test case, or fully parallel.
> Test histories — every run saved. Compare any previous session.
But the part I'm most excited about isn't the app. It's the ecosystem.
BenchLocal is a platform. Each Bench Pack is a plugin. I'm shipping an SDK so anyone can build their own — test what matters to you, package it, share it. Install and uninstall packs right inside the app, same way you'd manage extensions in VS Code. The registry is GitHub-based, fully public.
I built 6 packs. I want the community to build the next 60.
Theme system built in too — because if I'm staring at benchmark results for hours, it should at least look good.
v0.1.0 is macOS only. Windows and Linux are coming.
MIT licensed. Everything — the app, the bench packs, the SDK — is open.
PRs welcome. Bench Packs even more welcome.