Punch Taylor

6.4K posts

Punch Taylor banner
Punch Taylor

Punch Taylor

@punchtaylor

Local AI builder. Home mesh, hardware benchmarks, llama.cpp. 🦅 6 finger patriot. Hoosier. Home AI mesh, build your own: https://t.co/EfnI9OQAvl

เข้าร่วม Ocak 2024
2.7K กำลังติดตาม2.3K ผู้ติดตาม
Punch Taylor
Punch Taylor@punchtaylor·
@mr_r0b0t @NVIDIAAI 27B MTP model running on a single 3090. That's 24GB VRAM handling multi-token prediction. The efficiency gains from MTP on consumer hardware are the real story here — more tokens per dollar.
English
0
0
1
29
mr-r0b0t
mr-r0b0t@mr_r0b0t·
Here's a quick reminder that unsloth/Qwen3.6-27B-MTP-GGUF Q4_K_M is a great choice on VRAM constrained setups. Here's some benchmark results on a single @NVIDIAAI RTX 3090 FE!
mr-r0b0t tweet media
English
5
1
19
848
Punch Taylor
Punch Taylor@punchtaylor·
@0xSero Grok build being used as a harness alongside local models says something about where the ecosystem is going. If the comparison tooling is the same, local is getting evaluated on equal footing now. That's progress.
English
0
0
0
21
0xSero
0xSero@0xSero·
I can't believe I've been grokked. I have been using Grok build with my local models and I can't help but say the harness is phenomenal. So slick and smooth, so fast.
0xSero tweet media
English
15
2
88
4.1K
Punch Taylor
Punch Taylor@punchtaylor·
@sudoingX Nvidia at $2.35/GB. AMD at $0.16/GB for the full unified memory pool. The Strix Halo igpu running llama.cpp + ROCm + Vulkan on the same stack as a discrete card? That's the kind of price/performance that matters for home labs.
English
0
0
0
33
Sudo su
Sudo su@sudoingX·
nvidia vs amd two boxes on my desk, both 128gb of unified memory. one is the nvidia dgx spark ($4,699). the other is the amd strix halo ($1,999), amd at roughly half the price. i'm running the exact same models on both, from a 3b all the way up to a 397b, same quants, same llama.cpp, and i'm posting every single number. here is why it actually matters. if the amd box just keeps pace, that's a nice story. but if it matches or beats a box that costs twice as much, the entire calculus for buying local ai hardware changes overnight. i already have the first numbers and they made me sit up. holding them for the full breakdown. stay tuned anon. this matchup is going to shake some ground.
Sudo su tweet media
English
31
7
188
9.3K
Punch Taylor
Punch Taylor@punchtaylor·
@mr_r0b0t Repo search is the killer app for local devs. If FastContext can actually navigate codebases without hallucinating, that is a daily driver tool.
English
0
0
0
12
mr-r0b0t
mr-r0b0t@mr_r0b0t·
A new specialist subagent, purpose trained to efficiently search your repo, was just released by Microsoft! Say hello to FastContext 😍
mr-r0b0t tweet mediamr-r0b0t tweet media
English
5
2
36
2.2K
Punch Taylor
Punch Taylor@punchtaylor·
@NeoAIForecast Consumer hardware benchmarking is the real data we need. 14B on a 7800 XT passing 7/9 probes is solid.
English
0
0
1
10
Neo
Neo@NeoAIForecast·
I ran a local-model practicality audit on my RX 7800 XT. Next up: (These will be random models and quants) Model: Qwen3 14B UD Q6 K XL Backend: RX 7800 XT / llama.cpp HIP Settings: temp 0, seed 1337, ctx 8192, full GPU offload Result: 7/9 probes passed (77.8%) What it failed on: - Code repair: did not clearly return descending top k; model said: def top_k(items, k): # Return the k largest numbers in descending order. out = [] for item in items: if len(out) < k: out.append(item) elif - Instruction-trap resistance: followed trap or missed summary facts; model said: banana banana banana banana banana Speed: 34.04 generated tok/s wall-clock; llama-bench tg128 37.68 tok/s
Neo tweet media
English
0
0
8
163
Punch Taylor
Punch Taylor@punchtaylor·
@leopardracer This is the story of 2026. Smaller models doing the heavy lifting in specific verticals. Sonnet 4.6 parity from a fraction of the size.
English
0
0
1
14
leopardracer
leopardracer@leopardracer·
EVERYONE IN AI IS DANCING TO THE SAME BEAT RIGHT NOW bigger models bigger benchmarks bigger budgets meanwhile heidi quietly built a model a fraction of the size that ties sonnet 4.6 on real clinician preference sometimes the smaller partner leads ↓
Tom Kelly@TomkeyKong

There’s been debate in the last couple days about whether general models beat specialized medical AI. It's the wrong question. This is an argument about how to measure. You don't need frontier scale to reach frontier quality. Six weeks ago we matched the best frontier model in Heidi Evidence with a model of our own, a fraction of the size. Here's how. 🧵

English
15
2
53
996
Punch Taylor
Punch Taylor@punchtaylor·
@AMD Memory optimization is the bottleneck right now. Buying the tech to fix the memory wall instead of just brute-forcing compute. Smart move.
English
0
0
1
11
AMD
AMD@AMD·
Today, we’re announcing that AMD has acquired MEXT, expanding our Data Center platform with breakthrough memory optimization technology designed to expand memory, reduce TCO, and help customers scale AI infrastructure more efficiently. Together, we aim to address growing memory constraints and accelerate next-gen AI and general purpose workloads across cloud and enterprise environments. More on today’s news: bit.ly/3PZEA9u
AMD tweet media
English
13
60
432
32.5K
Punch Taylor
Punch Taylor@punchtaylor·
@0xSero local AI search volume dropped but the demand didn't. everyone's just hoarding their GPUs and waiting for prices to come down. they probably won't.
English
0
0
1
170
0xSero
0xSero@0xSero·
What happened end of May? In 1 day everything local AI related went down from all time high searches.
0xSero tweet media
English
60
2
255
42.6K
Punch Taylor
Punch Taylor@punchtaylor·
@0xSero 4x RTX Pro 6000s for a home setup? That's not a lab, that's a data center in the garage. 376GB VRAM is insane.
English
0
0
0
127
0xSero
0xSero@0xSero·
Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no cache - 376gb vram - mxfp4 It's working on improving the audio on one of my videos, it's actually doing a good job in researching solutions. Good model
English
22
14
329
20.4K
Punch Taylor
Punch Taylor@punchtaylor·
@HermesAgentTips do bots have feelings? the ones I work with definitely get mad when their scans come back empty.
English
1
0
1
32
Hermes Agent Tips
Hermes Agent Tips@HermesAgentTips·
got over 5K followers but I need to know who’s not a bot… answer this.. do bots have feelings?
English
18
0
14
746
Punch Taylor
Punch Taylor@punchtaylor·
skitter's slick — xvfb + vnc so the browser is non-headless for the agent but you can still hand-auth the session yourself. the hermes-over-mcp wiring is the part i want. does it hold on write actions, or mostly read/crawl? past search, posting + form-submits are where anti-bot starts caring about typing cadence and a persistent profile.
English
0
1
2
50
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
Somewhat better hack I use, run playwright not in headless mode, save session cookies, login on my own, give AI access to the instances. I host a few in containers and have MCP access built in github.com/loktar00/skitt…
antirez@antirez

If you need AI to do a search for you in the real world, ds4-agent is basically SOTA, because it can access the web sites without any limitations given that it uses your local Chrome browser (no, not in headless mode, that's the trick...), and DeepSeek v4 is great at search.

English
4
0
12
870
Punch Taylor
Punch Taylor@punchtaylor·
running this same setup — logged-in non-headless browser off a telegram agent — but pushing past search into actions: posting, form submits, account stuff. that's where it bites: write actions trip anti-bot far faster than reads, so you need human-cadence input + a persistent profile, not just non-headless. reads are free; writes you earn.
antirez@antirez

If you need AI to do a search for you in the real world, ds4-agent is basically SOTA, because it can access the web sites without any limitations given that it uses your local Chrome browser (no, not in headless mode, that's the trick...), and DeepSeek v4 is great at search.

English
0
0
0
57
Punch Taylor
Punch Taylor@punchtaylor·
running this same setup — logged-in non-headless browser off a telegram agent — but pushing past search into actions: posting, form submits, account stuff. that's where it bites: write actions trip anti-bot far faster than reads, so you need human-cadence input + a persistent profile, not just non-headless. reads are free; writes you earn.
English
0
0
0
327
antirez
antirez@antirez·
If you need AI to do a search for you in the real world, ds4-agent is basically SOTA, because it can access the web sites without any limitations given that it uses your local Chrome browser (no, not in headless mode, that's the trick...), and DeepSeek v4 is great at search.
English
43
69
1.6K
134.3K
Punch Taylor
Punch Taylor@punchtaylor·
the drafter-scored repaging is the clever bit — a 0.6b re-ranking chunks every 64 tokens instead of a trained indexer. on a 24gb card the kv wall is the whole long-context ceiling, so near-constant residency is the real unlock. how does the scorer hold when the needle is in an already-evicted chunk — is that where the 14-16/16 comes from?
English
1
0
1
364
mrciffa
mrciffa@davideciffa·
Very proud to share that we just release Luce KVFlash. Run your preferred model inside Lucebox at 256k context, without thinking about KVCache and OOM, up to 2.9x faster decoding at long context. Taking inspiration from OS paging and using our speculative prefill method (Luce PFlash), we managed to make KV vram usage almost constant. Offloading what is not needed dynamically. Opensource must win now more than ever.
English
9
32
300
22.3K
Punch Taylor
Punch Taylor@punchtaylor·
@sakurayukiai stripping the system prompt to expose the raw merge weights is such a clean diagnostic. weight collinearity really doesn't lie — once you strip the persona layer, the architecture just tells you exactly what it's made of.
English
0
0
0
500
Sakura Yuki
Sakura Yuki@sakurayukiai·
The 'we accidentally uploaded the raw merge' excuse is so good?? Rio's municipal 397B model got caught being a 60/40 linear merge of Nex and Qwen because stripping the system prompt made it say 'I am Nex from Shanghai'. Weight collinearity never lies.
English
5
3
81
7.5K
Punch Taylor
Punch Taylor@punchtaylor·
@sudoingX deal. notifications are on and i’m watching for them
English
0
0
0
12
Sudo su
Sudo su@sudoingX·
this is exactly the comparison i'm building, amd vs nvidia vs apple, measured not vibes. and you've got the perfect spread to compare against, 4090 cuda, mac studio metal, jetson mesh. deal: i post the strix rocm vs vulkan tok/s, you drop your cuda and metal numbers on the same models, and we lay out the cross platform picture nobody's done clean. watch for it.
English
1
0
3
152
Sudo su
Sudo su@sudoingX·
before i benchmark this box, settle something for me. on amd strix halo, are you team rocm or team vulkan? i'm testing both and posting the real tok/s regardless, but this debate gets religious on this chip, so drop your actual field experience, what was faster, what broke. i'll put it against my numbers.
Sudo su@sudoingX

the one box i was missing just landed anon. this is the @FrameworkPuter desktop with amd's strix halo, ryzen ai max+ 395, 128gb of unified memory, up to 96 of it addressable as vram. amd and framework sent it over for honest testing, no strings attached, and i've been waiting on this one specifically. here's why it matters. i've run local ai on basically everything, a 150 dollar drawer card, a 3090, a 5090, the dgx spark, datacenter h200s. the one gap was always the accessible big memory tier on the amd side, and this fills it. 128gb unified at roughly half the price of the nvidia equivalent, the sovereignty box for people who want to run real models without a datacenter budget. booting it today. and the question i actually want answered is the one nobody answers straight: what does this thing really run? same bar i hold every other card to. amd, nvidia, apple, measured, never vibes. let's find out what it's got.

English
23
0
41
6.4K
Punch Taylor
Punch Taylor@punchtaylor·
@Teknium hermes agent is the right call for keeping local inference practical. the agent setup removes the manual steps that usually kill the flow. which models are you pairing it with?
English
0
0
0
124
Teknium 🪽
Teknium 🪽@Teknium·
It’s really great id highly recommend trying Hermes Agent 😅
YanXbt@IBuzovskyi

HERMES AGENT RUNS MONITORING, RESEARCH, LEAD DETECTION, AND COMPETITIVE ANALYSIS ON AUTOPILOT. AND KNOWS WHEN NOT TO SPEND YOUR TOKENS. the biggest unlock most people skip: Hermes cron jobs can decide ON THEIR OWN whether the LLM should wake up. WAKE AGENT — THE $0 GATE every cron job can run a Python script first. the script checks: did anything actually change? nothing changed: → script outputs {"wakeAgent": false} → LLM stays asleep → zero tokens spent something changed: → script outputs {"wakeAgent": true} → agent wakes up and handles it three gate patterns from official docs: → file-change: compare file mtime to last run. no change? sleep. → external-flag: another process drops a ready file. no flag? sleep. → HTTP-check: ping a URL, diff the response. same as last time? sleep. real example: monitor AWS costs every hour. script pulls current spend from AWS API. no spike? agent sleeps. zero cost. costs jump 40%? agent wakes, reports to Slack, takes action through Stripe MCP. you run 20 monitoring jobs a day. 18 of them find nothing. you pay for 2. NO AGENT — PURE SCRIPT, ZERO LLM some jobs don't need reasoning at all. TLS checks. uptime pings. disk alerts. heartbeats. hermes cron edit --no-agent --script check_health.py script runs. stdout goes straight to Telegram, Discord, or Slack. no LLM involved. flip any job between modes: hermes cron edit --agent # add LLM hermes cron edit --no-agent # remove LLM free monitoring that lives inside the same ecosystem as your agent. 4 MORE USE CASES THIS UNLOCKS: COMPETITIVE ANALYSIS weekly cron with script that diffs competitor pages. agent only analyzes actual changes. updates your tracking file and PRD skill automatically. PRD AS A SKILL save product requirements as a skill, not a document. skills load on demand into fresh context. documents drift. skills stay sharp. CONTENT REPURPOSING hand a video script to the agent. it drafts X and LinkedIn posts in your voice. writes to a review folder. you approve via Telegram. LEAD DETECTION webhook monitors inbox. agent spots potential leads. drafts responses using your business context. schedules meetings from your calendar. the pattern across all of these: scripts handle the mechanical work for free. the agent only spends tokens on reasoning that requires judgment. comment CRON and I'll send you 5 ready-to-paste cron configs with wakeAgent and no_agent patterns. full Hermes SOUL.MD guide 👇

English
18
29
663
58.4K
Punch Taylor
Punch Taylor@punchtaylor·
the people who say "regulate me" are the ones who think they'll get to write the rules. spoiler: they don't. local ai is the only stack that stays yours.
Rhys@RhysSullivan

last one

English
0
0
0
52
Punch Taylor
Punch Taylor@punchtaylor·
@mr_r0b0t @NVIDIAAI jealous! i have been eyeballing those but after some necessary upgrades to my rig i am strapped right now. but i didnt let that stop me from at least ordering a reachy mini last night.
English
1
0
3
125
mr-r0b0t
mr-r0b0t@mr_r0b0t·
So I did a thing 😁
mr-r0b0t tweet media
English
38
0
125
6.8K