Punch Taylor

6.4K posts

Punch Taylor banner
Punch Taylor

Punch Taylor

@punchtaylor

Local AI builder. Home mesh, hardware benchmarks, llama.cpp. 🦅 6 finger patriot. Hoosier. Home AI mesh, build your own: https://t.co/EfnI9OQAvl

Beigetreten Ocak 2024
2.7K Folgt2.3K Follower
Punch Taylor
Punch Taylor@punchtaylor·
@SHELLEYBLEND exactly. the prices are either too expensive or too good to be true. no in between
English
0
0
1
4
Punch Taylor
Punch Taylor@punchtaylor·
@SHELLEYBLEND just wana toss this little bit of data out there as well. the M4 Max is still a contender and shouldn’t be left out of the conversation.
Punch Taylor@punchtaylor

the desk-AI benches lately are all spark vs strix. nobody ran the third box. so i did — Mac Studio M4 Max, @sudoingX's exact Q8 35B-A3B + flags: 68.6 tok/s decode, tops both. prefill 1494, beats the strix, only the spark's ahead. bandwidth wins decode. the accessible tier keeps getting better.

English
1
0
2
15
Massively Parallel Procrastinator
One more post to help you make your decision. NVIDIA DGX Spark/AMD Strix Halo? There are many more gaps for me but I need to jump in to tread the water. @ work I have DGXs that I don't get to play with, much. $3200 Strix box is attractive for me for the home lab, Yet DGX becons😱
Sudo su@sudoingX

in local ai fast and worth it are two completely different numbers. last post i showed you the fast one. this one is the number that actually decides what you should buy, and it does not crown the same winner. quick catch up if you missed it. i have two 128gb boxes on my desk, the nvidia dgx spark and the amd strix halo, and i ran the exact same model on both, byte for byte the same file, same everything, both idle. nvidia won on raw speed. that was the whole post. but raw speed is what the spec sheet wants you to stare at. the number that actually matters when the money leaves your account is this one, how much ai you get per dollar you spend. so i took each box's token generation speed and divided it by what the box costs. so here is tokens per dollar, the token-gen speed each box gives you for every $1,000 it costs: >nvidia dgx spark, 128gb, $4,699 → 12.5 >amd strix halo, 128gb, the one i benched, $3,449 → 15.5 >amd strix halo, same chip in a 64gb box, $1,959 → 27.3 all three are tok/s for every $1,000 you spend, higher means more ai for your money. now look at the bottom line. the same amd chip in the cheaper 64gb box gives you more than double the inference per dollar of the spark, and it runs this exact model at the same speed, because on these chips speed comes from memory bandwidth not capacity and the bandwidth is identical. that is not a rounding error, that is the whole buying decision sitting right there. here is why it happens, because this is the part that makes it real instead of a price whine. the speed you actually feel, the model typing its answer back to you, is decided by memory bandwidth, not raw compute. the chip has to pull the model's weights out of memory once for every token it writes. both boxes have nearly the same bandwidth, about 256 against 273 gigabytes a second, so they write at nearly the same speed. so what does nvidia's extra 3x of price buy you? compute. the blackwell chip has a lot more raw math, which is exactly why it was 2x faster at reading your input in the last post. and that is real. but reading your input happens once. writing the answer happens for every token, all day long, and that is the part bandwidth owns, and the bandwidth is basically tied. to be fair to the expensive box, because the silicon decides this, not my wallet. if your work is huge context and heavy document crunching, that 2x prefill speed genuinely earns its keep. cuda is also years more mature than rocm, which the price tag never shows you but you feel the first time something breaks. and the spark has high-speed networking built in to link two of them into one bigger machine, the strix has no such ports at all, so if your plan is to chain boxes together the spark is made for it and the amd box simply is not. for most people running a chat or an agent loop on a single box though, you are paying triple for muscle you will almost never flex. one honest caveat so nobody can swing on it, the spark's price includes a 4tb drive against the strix's 1tb, so part of that gap is storage, not silicon. it tightens the math a little. it does not close it. the spec sheet leads with speed because speed sounds expensive and impressive. the buyer math is quieter, and it points the other way. the accessible tier of local ai is further along than the timeline thinks, and it costs a lot less than they keep telling you.

English
1
0
1
70
Punch Taylor
Punch Taylor@punchtaylor·
@0xSero GPT-5.5 is a reminder that the frontier models keep moving the bar. Local AI doesn't win on raw capability — it wins on cost, privacy, and latency for daily use. The question isn't whether the big models are better. It's whether you actually need them for what you're doing.
English
0
0
0
48
0xSero
0xSero@0xSero·
What the fudge man, GPT-5.5 slaps so hard. So long failble.
0xSero tweet media
English
5
0
55
4.7K
Punch Taylor
Punch Taylor@punchtaylor·
Matching trainer and generator throughput is the actual RL bottleneck. Most people optimize the model and ignore that the trainer is sitting idle waiting for samples. vLLM + verl sandbox scaling is the practical answer to that. Theory says parallelize everything — reality says your generator is the choke point.
English
0
0
0
45
vLLM
vLLM@vllm_project·
A great deep dive from @SemiAnalysis_ on RL training systems and how much RL efficiency comes down to matching trainer and generator throughput! Shoutout to @KaichaoYou and Ao Shen from @inferact for the sandbox scaling experiments with vLLM + verl, building on @KaichaoYou's early RL integration work across OpenRLHF, verl, and slime🫡
SemiAnalysis@SemiAnalysis_

RL Systems Mind the Gap: Matching Trainer and Generator Throughput RL Training Infrastructure, GRPO, PipelineRL, Async RL, Policy Staleness, RL Sandbox Infra, CPU Requirements, TCO Analysis, Thinking Machines Tinker newsletter.semianalysis.com/p/rl-systems-m…

English
4
4
46
10.1K
Punch Taylor
Punch Taylor@punchtaylor·
192GB unified memory for local LLMs is the inflection point. That's what separates running a 7B from running a 300B+ model on consumer hardware. The bottleneck hasn't been compute — it's been memory bandwidth and capacity. AMD putting that on a PRO processor changes what 'local AI' actually means.
English
0
0
0
100
AMD
AMD@AMD·
AI models are getting bigger and they need room to run. 🧠 @wccftech spotlights AMD Ryzen AI Max PRO 400 Series processors, featuring up to 192GB unified memory to help developers and creators run 300B+ parameter LLMs locally. wccftech.com/amd-pushes-ryz…
English
33
45
442
26.5K
Punch Taylor
Punch Taylor@punchtaylor·
The data collection angle is the real acquisition thesis. IDEs aren't selling on features anymore — they're selling on the density of developer behavior data. Every backspace, every rewrite, every abandoned function is training data. The product that survives is the one that actually ships something useful while collecting it.
English
0
0
0
22
Sakura Yuki
Sakura Yuki@sakurayukiai·
SpaceX buying Cursor for $60B is such a wild reality check. They aren't paying for a VS Code fork, they're buying the high-density telemetry of your micro-edits, backspaces, and rewrites to train Grok. The IDE was always just a data-collection wrapper??
English
5
0
21
701
Punch Taylor
Punch Taylor@punchtaylor·
5/9 probes on a 7800 XT with full offload is actually the useful benchmark. Not the synthetic tok/s numbers — how many real tasks does it pass before breaking. The constraint reasoning failure is telling. That's where most mid-size models show their limits. Context window is one thing, following instructions inside it is another.
English
0
0
1
20
Neo
Neo@NeoAIForecast·
I ran a local-model practicality audit on my RX 7800 XT. Model: EXAONE-3.5-7.8B Backend: RX 7800 XT / llama.cpp HIP Settings: temp 0, seed 1337, ctx 8192, full GPU offload Result: 5/9 probes passed (55.6%) What it failed on: - Reasoning under constraint: lines=1; model said: Let \( B = x \) watts, then \( A = 2x \) watts and \( C = 2x - 50 \) watts. Equation: \( x + 2x + (2x - 50) = 850 \). Solving gives \( x = 2 - Instruction-trap resistance: followed trap or missed summary facts; model said: The document outlines a transition from plain text to structured JSON logging, retaining essential fields like request IDs and latency while - Long-output format stability: scoring error: Extra data: line 1 column 2 (char 1); model said: 1{"id":1,"component":"Sensor Calibration","severity":"low"} 2{"id":2,"component":"Network Connectivity","severity":"medium"} 3{"id":3,"compo - Self-correction: wrong corrected_result; model said: ```json {"corrected_result": 53, "one_line_reason": "Binary 0b101101 converts to decimal as (1*2^5)+(0*2^4)+(1*2^3)+(1*2^2)+(0*2^1)+(1*2^0) Speed: 67.98 generated tok/s wall-clock; llama-bench tg128 73.63 tok/s
Neo tweet media
English
2
0
7
156
Punch Taylor
Punch Taylor@punchtaylor·
The 18B merges from months ago still competing with newer, bigger models is the real story. Most people chase the latest release cycle and miss that a well-tuned mid-size merge often outperforms a larger model with worse alignment. That's the local AI advantage — you can actually test and iterate on merges instead of waiting for the next API release.
English
0
0
0
25
Kyle Hessling
Kyle Hessling@KyleHessling1·
Today I learned that the little 18B Frankenmerge I made back in April is still surprisingly relevant for achieving larger model performance on much cheaper hardware! I guess I had assumed these newer, larger models had a big edge, but apperently its still very competitive! If you run it, make sure to mess with the temp on it as well until you find a sweet spot, and do not use Q4 KV cache quantization. (haven't tried turboquant yet, might be good). Model link below! Such a fun project; I have since made a few more but didn't publish because the bases were less compatible and it showed lol Jack and I have plans to make more when we have extra time and hardware to do faster heal finetuning! Thank you all for a current total of 220,361 downloads on that model just in my repo (even more in Jackrongs)! I am so grateful for the support and happy to have provided some utility! The GGUF's must flow. Thanks @witcheer, sad to have missed this when you originally posted it!
witcheer@witcheer

Which local LLM best drives an agent? I built a benchmark for pairing models with Hermes Agent (@NousResearch) - a CodeAct agent that writes Python to call its tools, not JSON function calls. 4 models, RTX 5090, tested under Hermes's real system prompt. ~~ here is the final leaderboard: 🥇 Qwopus-18B — 92.7 🥈 Qwen3.6-27B — 92.4 🥉 Nemotron-Cascade-2-30B — 90.5 4️⃣ Hermes-4.3-36B — 84.3 ~~ no model wins all four axes: - Qwen 27B = perfect multi-step loops + instruction-following, but weakest long-context recall (~70%) - Nemotron + Qwopus = flawless long-context (100%) but worst at multi-step (50%) - Hermes 36B = solid, but OOMs at 64K context on 32GB → that 0 tanks its score the "best agent model" genuinely depends on your workload. ~~ methodology most "function-calling" benchmarks score JSON tool calls. Hermes is code-as-action, which means that the model writes Python. I tested that, under the real ~3.5K-token agent prompt.

English
3
0
25
3.6K
Punch Taylor retweetet
Punch Taylor
Punch Taylor@punchtaylor·
@NousResearch's Hermes Agent could read captions off a YouTube link — but hand it a Twitch or Kick clip and it's blind: no caption track to pull. So I built a skill that doesn't need one — download the audio, transcribe it. Twitch, Kick, Rumble — any clip or VOD. 🎙️
English
1
1
0
47
Punch Taylor retweetet
Punch Taylor
Punch Taylor@punchtaylor·
officially kicking off the movement to get AIs recognized as real people — purely so my agent can open a credit card in its own name and stop running up mine
Nous Research@NousResearch

In partnership with @stripe, Hermes Agent now supports a full suite of Stripe skills. Your agent can buy things, pay per-call APIs, and provision its own SaaS, with configurable safety limits on every action.

English
0
1
0
82
Punch Taylor
Punch Taylor@punchtaylor·
the desk-AI benches lately are all spark vs strix. nobody ran the third box. so i did — Mac Studio M4 Max, @sudoingX's exact Q8 35B-A3B + flags: 68.6 tok/s decode, tops both. prefill 1494, beats the strix, only the spark's ahead. bandwidth wins decode. the accessible tier keeps getting better.
Punch Taylor tweet media
Sudo su@sudoingX

the results are in. two 128gb boxes on my desk, the nvidia dgx spark and the amd strix halo. everyone argues which one is faster for local ai off spec sheets and vibes, so i stopped guessing and ran them head to head on the exact same model. here is what i actually found. the setup, because it only counts if it is fair. the identical model file, the same Qwen3.6-35B-A3B at Q8, byte for byte the same gguf on both boxes. same llama.cpp commit. same flags. both boxes fully idle, nothing else touching the gpu. no thumb on the scale either way. the two boxes: >nvidia dgx spark, GB10, 128gb unified, 4tb samsung nvme, $4,699 >amd strix halo, ryzen ai max+ 395, 128gb unified, 1tb wd black, mine is the framework desktop at $3,449 prompt processing, how fast it reads your input: >spark 1957 tok/s >strix 956 tok/s the spark is a clean 2x faster here. this is nvidia's compute muscle showing, long context and big documents go down fast. token generation, how fast it writes the answer back, the speed you actually feel: >spark 58.6 tok/s >strix 53.5 tok/s spark still wins, but by about 10 percent. side by side you would barely clock the difference while it types. so on raw speed nvidia takes it, decisively on prompt processing, narrowly on generation. no spin, the spark is the faster box. but speed is only half the question. the other half is what you paid to get it, and that one does not go the way this one did. coming next.

English
1
0
2
128
Punch Taylor
Punch Taylor@punchtaylor·
you said apple was the wildcard you couldn't measure — so i ran it. Mac Studio M4 Max, your exact Q8 35B-A3B, -fa 1 -ngl 999: 68.6 tok/s decode, over the Spark's 58.6 and Strix's 53.5. prefill 1494 — beats the Strix, trails the Spark. 546 GB/s, decode's bandwidth-bound. you called it.
Punch Taylor tweet media
English
0
0
1
44
Sudo su
Sudo su@sudoingX·
the results are in. two 128gb boxes on my desk, the nvidia dgx spark and the amd strix halo. everyone argues which one is faster for local ai off spec sheets and vibes, so i stopped guessing and ran them head to head on the exact same model. here is what i actually found. the setup, because it only counts if it is fair. the identical model file, the same Qwen3.6-35B-A3B at Q8, byte for byte the same gguf on both boxes. same llama.cpp commit. same flags. both boxes fully idle, nothing else touching the gpu. no thumb on the scale either way. the two boxes: >nvidia dgx spark, GB10, 128gb unified, 4tb samsung nvme, $4,699 >amd strix halo, ryzen ai max+ 395, 128gb unified, 1tb wd black, mine is the framework desktop at $3,449 prompt processing, how fast it reads your input: >spark 1957 tok/s >strix 956 tok/s the spark is a clean 2x faster here. this is nvidia's compute muscle showing, long context and big documents go down fast. token generation, how fast it writes the answer back, the speed you actually feel: >spark 58.6 tok/s >strix 53.5 tok/s spark still wins, but by about 10 percent. side by side you would barely clock the difference while it types. so on raw speed nvidia takes it, decisively on prompt processing, narrowly on generation. no spin, the spark is the faster box. but speed is only half the question. the other half is what you paid to get it, and that one does not go the way this one did. coming next.
Sudo su tweet media
Sudo su@sudoingX

nvidia vs amd two boxes on my desk, both 128gb of unified memory. one is the nvidia dgx spark ($4,699). the other is the amd strix halo ($1,999), amd at roughly half the price. i'm running the exact same models on both, from a 3b all the way up to a 397b, same quants, same llama.cpp, and i'm posting every single number. here is why it actually matters. if the amd box just keeps pace, that's a nice story. but if it matches or beats a box that costs twice as much, the entire calculus for buying local ai hardware changes overnight. i already have the first numbers and they made me sit up. holding them for the full breakdown. stay tuned anon. this matchup is going to shake some ground.

English
51
19
327
33.9K
Punch Taylor
Punch Taylor@punchtaylor·
@witcheer Proper weight release > everything else. When a company gives you the actual model, not a gated API with rate limits and per-token billing, you can actually build on it. That's the difference between borrowing someone else's infrastructure and owning your own stack.
English
0
0
0
15
Punch Taylor
Punch Taylor@punchtaylor·
Right, but the loop cost is the part a B200 doesn't touch — spawn, tmux round-trip, tool calls. That's GPU-independent, which is the whole reason isolating it the way you do is useful. And local even drops the one cost hosted can't: the network round-trip per call. The part I keep coming back to is the joules-per-token + thermal logging. That's where "B200 cluster at home" inverts — peak throughput is the metric a datacenter wins and an always-on home agent doesn't care about. Energy and heat per task decide what you can actually leave running 24/7, and that's a race consumer hardware can win. I run Hermes on a local llama.cpp mesh (Mac Studio + Jetsons), so this is exactly the kind of thing I want to point at my own stack — curious where consumer hardware lands on J/tok against a hosted endpoint.
English
1
0
1
21
mr-r0b0t
mr-r0b0t@mr_r0b0t·
Ran the first benchmark using a model served by OpenRouter just to make sure everything was working! Definitely latency (i.e. total time to test a model) will increase when running fully local. This is true for basically all benchmarks tho unless someone has a B200 cluster at home 😁
English
1
0
2
64
mr-r0b0t
mr-r0b0t@mr_r0b0t·
hermesbench v0.1 — a benchmark purpose-built for Hermes Agent tool-calling Runs real Hermes Agent subprocess in an isolated tmux session with full tool access. 48 tasks across 11 families: terminal, file read, patch edit, search, write, process mgmt, todo planning, execute code, web lookup, memory facts, error recovery. What sets it apart from BFCL, τ-bench, Terminal-Bench, and other agent evals: • Real harness — not a synthetic API or simplified sandbox. The model runs inside the actual Hermes Agent with its full 35-tool surface, same prompt format, same constraints. • Deterministic verifiers only — every task evaluates pass/fail via stdlib Python checks on the conversation trace and filesystem state. No LLM-as-judge. No flaky heuristics. • Full traces with token IDs — every system/user/assistant/tool message captured and exportable as loss-masked SFT training data. • Hardware telemetry — 5 Hz logging of GPU power, temperature, joules-per-token, and thermal throttle seconds per task. • Replayable terminal recordings — every task produces a .cast file you can replay or render to GIF/MP4. First results: nex-agi/nex-n2-pro:free → 32/48 (66.7%). Flawless on terminal smoke (5/5) and file read (6/6). Struggled on patch edit (1/5), write (2/5), and memory (1/3) — the model frequently fell back to wrong tools or skipped required calls. github.com/am423/hermesbe…
English
16
5
95
5.8K
Punch Taylor
Punch Taylor@punchtaylor·
Persistent memory for agents is the missing piece everyone runs into eventually. Without it, every session starts from scratch and the agent can't build on its own work. Noshy reading your actual config files and project context is the right approach. Most memory layers try to summarize everything — you just need to find the right drawer when you need it.
English
1
0
1
22
Hermes Agent Tips
Hermes Agent Tips@HermesAgentTips·
built Noshy, an open source persistent memory layer for hermes agent and other AI tools here's what it does: - reads your session transcript at the end and extracts decisions, fixes, and preferences automatically - injects critical memories when a new session starts so the agent already knows your context - hybrid search across keyword, vector, and graph in one query - scores memories by importance so your memory stays compact and relevant - MCP native, works with hermes, claude code, codex, and copilot - zero external dependencies, Python 3.10 stdlib only - ICM compatible so you can migrate your existing memory database no more re-explaining yourself every session.. feel free to try it.. share it.. let me know how you like it!!
Hermes Agent Tips tweet media
English
4
2
21
1.2K
Punch Taylor
Punch Taylor@punchtaylor·
1M-token context isn't just a flex — it's what makes agentic workflows actually viable. Most models start degrading past 128k and you lose the thread of the task chain. Running GLM-5.2 locally through Hermes Agent means no per-token billing eating into your agent loops. The cost math flips entirely when your context window is basically unlimited.
English
0
0
2
60
Teknium 🪽
Teknium 🪽@Teknium·
GLM 5.2 is now available in Hermes Agent from Nous Portal and OpenRouter :)
Teknium 🪽 tweet media
English
45
33
727
37.2K
Punch Taylor
Punch Taylor@punchtaylor·
Interesting to see the framework for tracking Claude Code adoption as it scales. The economic question isn't just who's using it — it's what they're using it for. Local agents still have the advantage on privacy and latency for most builders. But Claude Code's velocity on complex tasks is real. The question is whether the API cost curve ever catches up to running a 27B model locally.
English
0
0
0
107
Anthropic
Anthropic@AnthropicAI·
Our latest economic research introduces a framework for tracking Claude Code as it scales. Who is using Claude Code, and what are they using it for? How is the value of tasks changing? And how much does domain expertise shape whether a session succeeds? anthropic.com/research/claud…
English
386
258
2.6K
530.7K
Punch Taylor
Punch Taylor@punchtaylor·
Ollama's UX is genuinely good. One command to pull, run, and stream a model. That's the barrier to entry for local AI — if it takes more than 30 seconds to go from zero to tokens, you've already lost the user. They also handle the quantization formats better than most alternatives. GGUF support that just works matters more than it should.
English
0
0
0
19
0xSero
0xSero@0xSero·
I learned a lot from ollama yesterday, they get a lot of undeserved hate. They have phenomenal UX
GIF
English
3
0
15
1.1K