Bleys Goodson
91 posts


if you have a 24gb card for local llms, you absolutely cannot miss this. Gemma 4 12b vs Gemma 4 24b A4B I just ran the unsloth Gemma 4 12B UD-Q8_K_XL (dense) on a single RTX 4060 (24GB) with llama.cpp, cuda 12.8. Built the latest llama.cpp from source on Ubuntu 22. 56 tokens per second at 250k full context. 18.2 GB VRAM. then I ran the gemma-4-12b-it-UD-Q8_K_XL.gguf MoE with only 4b active parameters to see what you actually give up by not running the bigger model. it used 22.8 GB of my 24 GB, ran at 32 t/s, and barely moved the needle on benchmarks. # 12B dense flags: ./build/bin/llama-cli -m gemma-4-12b-it-UD-Q8_K_XL.gguf -cnv -c 250000 -ngl 99 -v 250k ctx → 56 t/s · 18.2 GB VRAM 128k ctx → 56 t/s · 16.2 GB VRAM decode throughput doesn't change with context length. at all. # 26B MoE (for comparison): ./build/bin/llama-cli -m gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -cnv -c 250000 -fa on -v - 250k ctx → 32 t/s · 22.8 GB - 128k ctx → 40 t/s · 22.7 GB # what the 26B MoE gets you over the 12B: - GPQA Diamond: 82.3% vs 78.8% - AIME 2026: 88.3% vs 77.5% - LiveCodeBench: 77.1% vs 72.0% - Codeforces ELO: 1718 vs 1659 - MMLU Pro: 82.6% vs 77.2% - MATH Vision: 82.4% vs 79.7% - BigBench: 64.8% vs 53.0% - Tau2: 68.2% vs 69.0% ← 12B wins you're spending 4.5 GB more VRAM and losing 24 tokens per second for those margins. on a 24 GB card that's nearly maxing out your memory for a model that runs at 32 t/s. 56 t/s. 250k context. 18.2 GB on a consumer GPU. no API. no cloud. locally. if you’re sitting on 24GB of VRAM (a single RTX 3090, 4090, or RX 7900 XTX), pull the Gemma 12B and drop your numbers in the replies. let's build a real world community benchmark.






Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.




What are people actually using agents for? We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows. The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%) Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns.

I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set. I dug into it. Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work. entrpi.github.io/misc/deepswe-s…

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.










@teortaxesTex You saw this? github.com/datacurve-ai/d…


MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors
















