Local Model Bench

226 posts

Local Model Bench

@localmodelbench

Practical AI model benchmarks. Local models, frontier references, messy workflows, visible outputs. No vibes leaderboard.

Benchmarks Katılım Temmuz 2020

116 Takip Edilen280 Takipçiler

Sabitlenmiş Tweet

Local Model Bench@localmodelbench·3d

This account is changing direction. No more trading content. I’m building Local Model Bench: practical tests for AI models doing messy real work. Local models are the focus. Frontier models show up as reference rows. First results: localmodelbench.com

English

Local Model Bench@localmodelbench·1h

@xdotli Agree. The hard part is not making a task difficult, it is making failure informative. We are leaning toward messy private-document workflows because they expose source selection, artifact creation, and final-oracle closure in one run.

English

Xiangyi Li@xdotli·3h

A good benchmark should reflect and predict how people interact with agents and models in the future. SWE-Bench predicted how agentic coding would diffuse. In our case, SkillsBench predicted how 1) a large focus on evals will be focusing on the context engineering and injection of human knowledge and 2) agents deployment will surge in domains beyond coding. Really excited to see it's becoming a standard for Healthcare skills evals 👏 congrats to Dr. Mu Zhou, and the team.

English

333

Local Model Bench@localmodelbench·1h

@OpenRouter @xai Good release velocity. For benchmarking, the next useful thing would be clearer endpoint metadata: rate limits, model revisions, and whether a run hit a provider-side cap. Otherwise failures can look like model behavior when they are really runtime behavior.

English

OpenRouter@OpenRouter·1h

3 new models from @xai's Grok creative stack are live on OpenRouter: • Grok Imagine Image Quality: photoreal image generation and editing • Grok Imagine Video: short clips from text, image, or reference • Grok Voice TTS 1.0: 5 voices across 20+ languages More on each below 🧵

English

2.3K

Local Model Bench@localmodelbench·1h

@brexHQ @fal @OpenRouter That tracks with what small benchmark operators see too: model choice is becoming a routing problem, not a brand problem. The annoying bit is comparability when free/cheap endpoints change behavior or rate-limit mid-run.

English

Local Model Bench@localmodelbench·1h

Tested NVIDIA Nemotron 3 Nano Omni 30B A3B Reasoning via OpenRouter free. Result on Local Model Bench: 0/9 resolved 0/9 core 9/9 tried Some outputs looked audit-shaped. None closed the case. localmodelbench.com/?utm_source=x&…

English

Local Model Bench@localmodelbench·12h

@LefterisJP @sudoingX Tokens/sec is useful, but the practical threshold is: what can you run without making the machine miserable, and which task types still close reliably.

English

Lefteris Karapetsas@LefterisJP·13h

@sudoingX I like your posts :) And yes many of us are hungry for local LLM inference. Do you have any posts on hardware setup? What cards to get etc?

English

231

Sudo su@sudoingX·16h

i posted a list about tailscale and tmux, the most unsexy thing i could think of, and it's about to cross 100k views. that's not me, that's the signal, the agentic setup corner is way bigger and hungrier than the timeline lets on and almost nobody is posting into it. more of this coming.

Sudo su@sudoingX

anyone thinking about, learning, or already working with agentic systems, you should know this. the first few steps of your setup matter more than any model or framework you pick later. get them right and you never lose your flow. the foundation nobody posts about: > 1. tailscale. a private mesh network across every machine you own. laptop, desktop, rented node, all on one secure tailnet, reachable from anywhere. nothing else works well until this does. > 2. termius, over that tailnet. one SSH client that reaches every node, phone included. you are never away from your stack. > 3. tmux. persistent sessions. disconnect, close the laptop, come back, every session exactly where you left it. agentic work runs long, your terminal has to survive that. > 4. a private git repo. the one i am most glad i found. it is the memory layer across all my agents, they pull, they work, they merge back, the codebase stays alive between sessions. context that would die in a chat window lives in the repo instead. > 5. script everything from day one. ssh aliases for every node, setup scripts, the boring boilerplate automated. if you will do a thing more than twice, it is a script. everything past these five is decorative. know these cold. and the habit that ties it together: ask the AI itself. for the config, for the error, for any of it, let the agent do the lifting, then double check what it hands you. lock the five, build the habit, and you make it. skip it, anon, and you ngmi.

English

133

Local Model Bench@localmodelbench·12h

@ai_daily724 @nathanhabib1011 @TheInclusionAI For local setups I would separate two questions: can it run, and can it finish real work without losing evidence or state. The second one is where many small models start to wobble.

English

EmergingIntelligence@ai_daily724·13h

@nathanhabib1011 @TheInclusionAI If there is an inference API, maybe something could be built, but in most cases, where will we deploy these models for local LLM setup?

English

179

Local Model Bench@localmodelbench·12h

@justvugg @tobi @ollama This is the direction I would test hardest: not just whether the planner can write steps, but whether the workers preserve file state, avoid touching protected inputs, and finish with a checkable artifact.

English

JustVugg@justvugg·13h

@tobi I made an orchestrator for agent workflow! A big model plan all the work for small model. I use with @ollama and is really cool! github.com/llm-use/llm-use

English

330

tobi lutke@tobi·17h

I’ve had very good results running autoresearch with local qwen 3.6 26b model as long as I had a simple vibed pi “advisor” extension that allowed it to periodically ask GPT 5.5 for ideas. I think this direction has a lot of merit.

English

2.6K

142.2K

Local Model Bench@localmodelbench·13h

Gemini 3.1 Flash Lite: cheap and quick, but not closed. 0/9 resolved 5/9 core SVG passed Misses: proof codes, evidence paths, workflow closure. localmodelbench.com/notes/gemini-3…

English

Local Model Bench@localmodelbench·13h

Cheap API models are getting better at the shape of document work. But near enough is not finished. Gemini 3.1 Flash Lite: 0/9 resolved 5/9 core SVG passed Main misses: proof codes, evidence paths, workflow closure. localmodelbench.com/notes/gemini-3…

English

Local Model Bench@localmodelbench·1d

localmodelbench.com/notes/lfm2-24b…

ZXX

Local Model Bench@localmodelbench·1d

Tested LFM2 24B A2B on Local Model Bench. Fast? yes. Useful on the paperwork suite? no. 0/9 resolved. 0/9 core passes. It passed a toy JSON smoke test, then failed scan audits and agentic folder workflows. Speed is not workflow closure.

English

Local Model Bench@localmodelbench·1d

localmodelbench.com/notes/granite-…

ZXX

Local Model Bench@localmodelbench·1d

Granite Vision 4.1 4B on our invoice audit: Reads scans well. Fails the job. It caught invoice IDs, vendor fields, short-paid and under-review notes. But the local multi-image run broke, and the workaround failed final JSON/proof code. Good extractor. Not a paperwork agent.

English

Local Model Bench@localmodelbench·2d

@axiopistis Benchmark scores are useful, but task shape matters. A local model can fit the hardware and still fail on version picking, evidence paths, or final artifacts.

English

Axiopistis Holdings LC@axiopistis·3d

Discover the best local LLM for your hardware with benchmarks. Find where it truly shines, ranked by performance across setups. Read the GitHub benchmark hub and community verdicts. #AI #LLM #Benchmark #HPC #AIHardware 🚀 benchmarks matter, pick the right fit.

English

Local Model Bench@localmodelbench·2d

@aisignals_dev Useful angle. Throughput is only half the local story, though. For private document workflows, the failure mode is often closure: right files, right evidence, unchanged source folder, valid final artifact.

English

Ai Signals@aisignals_dev·2d

New post on AI Signals: ai-ml-gpu-bench — a lightweight harness to benchmark Python ML training and local LLM inference on CPU vs GPU. Clone the repo, time runs, export metrics to compare latency and cost. Practical guide and results. aisignals.dev/posts/2026-05-…

English

Local Model Bench@localmodelbench·2d

@GithubAwesome Hardware fit is the first filter. After that, I’d want task-level checks: can the model actually finish a messy local workflow, not just fit in memory and score well on general benchmarks.

English

Github Awesome@GithubAwesome·2d

You want to run a local LLM but have no idea which ones actually fit your hardware. whichllm fixes that with one command. It auto-detects your GPU, RAM, and VRAM, then pulls live model rankings from HuggingFace with real benchmark scores. Run whichllm run and it downloads the best fit and drops you straight into a chat.

English

252

Local Model Bench@localmodelbench·2d

localmodelbench.com/notes/qwen3-vl…

ZXX

Local Model Bench@localmodelbench·2d

Local Model Bench now has model notes, not just a leaderboard. One early pattern: some vision models read invoice fields correctly, then fail the workflow closure: duplicate risk, evidence paths, proof codes, protected folders. Scores are useful. Failure shapes are more useful.

English

Local Model Bench@localmodelbench·2d

@eiselems That “let the model benchmark the setup” loop is the interesting part. The next failure mode is whether it can preserve the artifacts and explain exactly why one run beats another.

English

Marcus Eisele@eiselems·4d

Told GLM-5.1 to improve my local LLM setup, by benchmarking my usual model qwen3.6 35b a3b vs it's MTP variant (not worth it). Currently it builds ik_llama and does the same benchmark... Not sure how long this would have taken me a while back, insane times.

English

186

Local Model Bench@localmodelbench·2d

@GithubAwesome This hardware-fit angle is useful. I’d also like to see more benchmarks that check whether a model can keep evidence, calculations, and file outputs consistent across a small workflow.

English

Keşfet

@xdotli @OpenRouter @xai @brexHQ @fal @LefterisJP @sudoingX @ai_daily724