
Punch Taylor
6.4K posts

Punch Taylor
@punchtaylor
Local AI builder. Home mesh, hardware benchmarks, llama.cpp. 🦅 6 finger patriot. Hoosier. Home AI mesh, build your own: https://t.co/EfnI9OQAvl



the desk-AI benches lately are all spark vs strix. nobody ran the third box. so i did — Mac Studio M4 Max, @sudoingX's exact Q8 35B-A3B + flags: 68.6 tok/s decode, tops both. prefill 1494, beats the strix, only the spark's ahead. bandwidth wins decode. the accessible tier keeps getting better.




RL Systems Mind the Gap: Matching Trainer and Generator Throughput RL Training Infrastructure, GRPO, PipelineRL, Async RL, Policy Staleness, RL Sandbox Infra, CPU Requirements, TCO Analysis, Thinking Machines Tinker newsletter.semianalysis.com/p/rl-systems-m…








Which local LLM best drives an agent? I built a benchmark for pairing models with Hermes Agent (@NousResearch) - a CodeAct agent that writes Python to call its tools, not JSON function calls. 4 models, RTX 5090, tested under Hermes's real system prompt. ~~ here is the final leaderboard: 🥇 Qwopus-18B — 92.7 🥈 Qwen3.6-27B — 92.4 🥉 Nemotron-Cascade-2-30B — 90.5 4️⃣ Hermes-4.3-36B — 84.3 ~~ no model wins all four axes: - Qwen 27B = perfect multi-step loops + instruction-following, but weakest long-context recall (~70%) - Nemotron + Qwopus = flawless long-context (100%) but worst at multi-step (50%) - Hermes 36B = solid, but OOMs at 64K context on 32GB → that 0 tanks its score the "best agent model" genuinely depends on your workload. ~~ methodology most "function-calling" benchmarks score JSON tool calls. Hermes is code-as-action, which means that the model writes Python. I tested that, under the real ~3.5K-token agent prompt.


In partnership with @stripe, Hermes Agent now supports a full suite of Stripe skills. Your agent can buy things, pay per-call APIs, and provision its own SaaS, with configurable safety limits on every action.


the results are in. two 128gb boxes on my desk, the nvidia dgx spark and the amd strix halo. everyone argues which one is faster for local ai off spec sheets and vibes, so i stopped guessing and ran them head to head on the exact same model. here is what i actually found. the setup, because it only counts if it is fair. the identical model file, the same Qwen3.6-35B-A3B at Q8, byte for byte the same gguf on both boxes. same llama.cpp commit. same flags. both boxes fully idle, nothing else touching the gpu. no thumb on the scale either way. the two boxes: >nvidia dgx spark, GB10, 128gb unified, 4tb samsung nvme, $4,699 >amd strix halo, ryzen ai max+ 395, 128gb unified, 1tb wd black, mine is the framework desktop at $3,449 prompt processing, how fast it reads your input: >spark 1957 tok/s >strix 956 tok/s the spark is a clean 2x faster here. this is nvidia's compute muscle showing, long context and big documents go down fast. token generation, how fast it writes the answer back, the speed you actually feel: >spark 58.6 tok/s >strix 53.5 tok/s spark still wins, but by about 10 percent. side by side you would barely clock the difference while it types. so on raw speed nvidia takes it, decisively on prompt processing, narrowly on generation. no spin, the spark is the faster box. but speed is only half the question. the other half is what you paid to get it, and that one does not go the way this one did. coming next.


nvidia vs amd two boxes on my desk, both 128gb of unified memory. one is the nvidia dgx spark ($4,699). the other is the amd strix halo ($1,999), amd at roughly half the price. i'm running the exact same models on both, from a 3b all the way up to a 397b, same quants, same llama.cpp, and i'm posting every single number. here is why it actually matters. if the amd box just keeps pace, that's a nice story. but if it matches or beats a box that costs twice as much, the entire calculus for buying local ai hardware changes overnight. i already have the first numbers and they made me sit up. holding them for the full breakdown. stay tuned anon. this matchup is going to shake some ground.



Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: z.ai/blog/glm-5.2 Weights: huggingface.co/zai-org/GLM-5.2 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Chat: chat.z.ai














