Thrawling

144 posts

Thrawling

@thrawling

가입일 Mayıs 2018

89 팔로잉9 팔로워

Thrawling@thrawling·1d

@TeksEdge Really, with which model? I haven’t tried it much.

English

David Hendrickson@TeksEdge·1d

@thrawling Also enabling speculative decoding

English

David Hendrickson@TeksEdge·1d

💡 It's still amazing to me that you can run an unsloth version of Qwen3.5-27B on a $2K AMD Ryzen Max+ 395 w/64GB of unified memory @ 10 tps at home. Nearly the same quality as Claude Opus 4 (May, 2025 release)

English

25.5K

Thrawling@thrawling·1d

@TeksEdge If you’re running Windows on it, swap to Ubuntu, instant 12 t/s improvement for me on same llama.cpp config 👍🏻

English

David Hendrickson@TeksEdge·1d

@thrawling Was running 35B when posted. Besides the post wasn’t meant to be a performance controversy. Still in awe of LLMs. Rumors are BigAI will jack up prices by end of year and it got me thinking about local inferencing.

English

645

Thrawling@thrawling·1d

@TeksEdge I would strongly recommend you use Qwen3.5-35B-A3B on this hardware. Runs at 57 tok/s and is good.

English

Thrawling@thrawling·1d

@TeksEdge That’s more realistic than your 40 tok/s claim David! I benchmarked at 12 tok/s on same chip, llama.cpp Vulkan in Ubuntu. Q8 at 6 tok/s. SEAVIV with AMD Ryzen AI MAX+ 395 128GB.

English

Thrawling@thrawling·1d

@alexellisuk Qwen3.5-35B-A3B is likely the best value you can get out of that.

English

Alex Ellis@alexellisuk·1d

I realised that some of my issues with the 3090s may have been from running 1x and not 2x PCIe cables from the 1200W PSU What agentic tasks would you run with 48GB of VRAM?

English

1.8K

Thrawling@thrawling·4d

@dankvr @joshiljainn @NousResearch Running Qwen3.5-35B-A3B with Hermes on Strix Halo. 122B Q4 works but the loss in t/s not worth the increase in quality. 35B >50t/s 122B 27t/s. How’s yours holding up?

English

jin@dankvr·4d

@joshiljainn @NousResearch Using a strix halo mini pc (gmktec evo-x2 128gb): strixhalo.wiki/Hardware it's like a more affordable DGX spark / mac studio x.com/dankvr/status/…

jin@dankvr

Was waiting for hardware to drop in price to run bigger LLMs locally. The Strix Halo mini PC market is the best bang for your buck atm: 128gb RAM systems a fraction of the cost of DIY / Nvidia's Spark and Mac Studios 47 tps for gpt-oss 120b, not bad! Low power too (140w)

English

2.3K

jin@dankvr·4d

Super impressed by Hermes + qwen3.5 for local AI assistant. It performs way better than openclaw I feel no need to run a cloud model on main. If I need it, can have hermes call codex or Claude code. Good harness by @NousResearch

English

419

25.1K

Thrawling@thrawling·12 Mar

@subtlebytes @TheAhmadOsman Qwen 3.5 9B Q4 with 128K context, KV cache Q8. 100 tok/s.

English

Mogis@subtlebytes·11 Mar

@thrawling @TheAhmadOsman How are you maxing it out?

English

Ahmad@TheAhmadOsman·11 Mar

I was assuming it’s common knowledge at this point not to use Windows for local LLMs But just in case: DO NOT USE Windows for local LLMs

English

159

987

96.1K

Thrawling@thrawling·11 Mar

@subtlebytes @TheAhmadOsman As someone who has 5070Ti; I wish I had more VRAM for LLMs. For ComfyUI it’s sufficient and I run it a lot.

English

Mogis@subtlebytes·11 Mar

@TheAhmadOsman Totally unrelated but what are your thoughts about an rtx 5070ti 16gb for local llm, I just want to get my hands dirty, is this a good starting point or would you suggest a 3090 with 24gb vram

English

1.6K

Thrawling@thrawling·10 Mar

@sudoingX It was so effortless to get Hermes running with local model (llama.cpp server Qwen3.5 35B on AMD AI 395) too, compared to the claw-variants and coding harnesses i’ve tried Hermes is a clear winner right now. Like you mention the transparency is what i’ve been missing.

English

130

Sudo su@sudoingX·9 Mar

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. @Teknium and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

Sudo su@sudoingX

okay the fuss around hermes agent is not just air. this thing has substance. installed it on a single RTX 3090 running Qwen 3.5 27B base (Q4_K_M, 262K context, 29-35 tok/s). fully local. my machine my data. first thing i did was tell it to discover itself. find its own model weights, check its own GPU, read its own server flags, and write its own identity document. it did all of it autonomously. nvidia-smi, process grep, file writes. clean execution. the TUI is genuinely premium. dark theme, ASCII art, color coded tool calls with execution times, real time streaming. you actually enjoy watching it work. 29 tools. 80 skills (that's what it reports on boot). file ops, terminal, browser automation, code execution, cron scheduling, subagent delegation. and it has persistent memory across sessions. setup took 5 minutes. one curl install, setup wizard, point to localhost:8080/v1, done. dropping qwopus for this test btw. distilled models compress reasoning and lose precision on real coding tasks. base model only from here. more experiments coming. octopus invaders (the same game that broke qwopus) will be built using hermes agent next. comparing flow and results against claude code on the same model. if you want to run local AI agents on real hardware this one deserves a serious look.

English

680

98.2K

Thrawling@thrawling·8 Mar

@posi_posi8 github.com/aria2/aria2

QME

posi_posi@posi_posi8·7 Mar

並列ダウンロードという方法があるのか。（回線速度が律速なら意味ないと思うけど）

日本語

175

Thrawling@thrawling·7 Mar

@sudoingX Since the 2.0 update Cline CLI has been a nice harness. I compared output from it, OpenCode, Aider and Pi. same model, same prompts and Cline won.

English

637

Sudo su@sudoingX·7 Mar

downloading Qwen3.5-27B Claude 4.6 Opus Reasoning Distilled(Qwopus) right now. Q4_K_M quant on a single RTX 3090. same hardware i've been testing every model on this month. someone took the base model i've been daily driving and distilled Claude Opus 4.6 reasoning chains into it. same 27B parameters, same architecture, but fine tuned on how Claude thinks through problems. the base model already built 1,827 lines of working code in 13 minutes with zero steering. curious what distilled reasoning adds. switching harness too. the base ran on OpenCode. this one runs through Claude Code. claude distilled model through claude's own coding agent. want to see if the reasoning patterns carry differently when the harness matches the distillation source. will post speed sweep first to get the numbers. then checking if the jinja template bug that silently kills thinking mode carries over from the base model. then octopus invaders. same prompt that base qwen passed in 13 minutes and hermes 4.3 failed on 2x the hardware. 4 models. 1 GPU. 1 prompt. results soon.

Sudo su@sudoingX

been daily driving qwen 3.5 27B dense. haven't even finished testing it properly and now claude opus reasoning gets distilled into the same base. things are dropping faster than i can benchmark. might pull this and test with claude code and opencode. first thing to check: does the jinja template bug carry over? the one that silently kills thinking mode when you use agent tools. if your server logs show thinking = 0, your model isn't reasoning and the server won't tell you. claude level reasoning on a single 3090. locally. we'll see.

English

798

128.3K

Thrawling@thrawling·6 Mar

@elroyic @sudoingX With the Q4? What are your settings?

English

Elroy Tracey@elroyic·6 Mar

@sudoingX Running with ROCm on Halo Strix 128Gb Qwen3.5 35B at full context get 35tok/s uses 60Gb Ram/VRAM.

English

326

Sudo su@sudoingX·6 Mar

if you have a single RTX 3090 and want the best local inference setup right now, here's what i landed on after testing 5 open source models across 7 GPU configs this month. GPU: 1x RTX 3090 24GB model: Qwen 3.5 27B Dense Q4_K_M (16.7GB) context: 262K (native max) speed: 35 tok/s generation, flat from 4K to 300K+ reasoning: built in chain of thought, survives Q4 quant config: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what this gives you: - 27B params all active every token - no speed degradation as context fills - full reasoning mode on a consumer GPU - 7GB VRAM headroom after model load tested MoE (faster but less depth per token) and dense hermes (same speed, degraded under load). qwen dense hit the sweet spot for single GPU. more architecture comparisons dropping soon. what's your single GPU setup? curious what configs people are running.

English

691

44.9K

Thrawling@thrawling·6 Mar

@sudoingX @TroyJeppesen There are few cases where ROCm wins out over Vulkan, it only marginally wins out in larger models from my experience in use and in running tests in llama-bench.

English

Sudo su@sudoingX·6 Mar

@TroyJeppesen vulkan beating rocm on AMD's own hardware. that's a data point worth more than any market research report.

English

450

Thrawling@thrawling·3 Mar

@petersteele @trapsticles Just wait for the Qwen3.5 0.8B is Opus-level intelligence posts 😂

English

Peter Steele@petersteele·3 Mar

Oh I bet it does. These are great models and distillations. I just wish people were honest about them instead of making them out to be something they are not. Wonder how many people are jaded after reading all the hype, downloading the models and finding out they are not Opus or Sonnet replacements?

English

Peter Steele@petersteele·3 Mar

Look, Qwen 3.5 35b A3B and the mini 9b is neat and all, but you idgets need to calm it wayyyy down on the outright lies and exaggerations on its capabilities. It is not sonnet level, it is not opus level, its not GPT level. It is not anywhere near Frontier model levels, locally. Its a good local model and a step in the right direction for sure, but many of you are on here outright lying about what it can really do. Stop it.

English

103

564

47.4K

Thrawling@thrawling·3 Mar

@petersteele @trapsticles I’ve been using Q6 of the 122B and it does much better, as expected.

English

Peter Steele@petersteele·3 Mar

@trapsticles Yup. I tested them out pretty decently on some real world samples I have here, and they failed miserably. They got some stuff right, but overall, completely unusable for any real work that isnt hyper focused and targeted.

English

509

Thrawling@thrawling·28 Şub

@ph1lanthrop @sudoingX @grok It’s local AI. Ran at home. Generating a simple game at good speeds and accuracy.

English

Ph1lanthrop@ph1lanthrop·28 Şub

@sudoingX @grok what does this mean?

English

744

Sudo su@sudoingX·28 Şub

testing Qwen3.5-35B-A3B latest optimized version by UnslothAI on a single RTX 3090. one detailed prompt. zero handholding. watch a 3B model scaffold an entire multifile game project autonomously. the setup: > model: Qwen3.5-35B-A3B (80B total, only 3B active per token) > quant: UD-Q4_K_XL by Unsloth (MXFP4 layers removed in latest update) > speed: 112 tok/s generation, ~130 tok/s prefill > context: 262K tokens > flags: -ngl 99 -c 262144 -np 1 --cache-type-k q8_0 --cache-type-v q8_0 > engine: llama.cpp > agent: Claude Code talk to localhost:8080 (llama.cpp now has native Anthropic API endpoint. no LiteLLM needed) q8_0 KV cache cuts VRAM usage in half vs f16 at 262K. -np 1 is default but worth noting. parallel slots multiply KV cache and at 262K that's an instant OOM. the prompt was more detailed than this but you get the idea: build a space shooter with parallax backgrounds, particle systems, procedural audio, 4 enemy types, boss fights, power-up system, and ship upgrades. 8 JavaScript modules. no libraries. game's called Octopus Invaders. gameplay footage dropping next.

Sudo su@sudoingX

the model i benchmarked day ago just got an upgrade. glad i slept on it. the baseline was already wild. 112 tok/s on a single 3090. 2.4x faster than Coder-Next on half the GPUs. 461 lines of working particle sim on first prompt. that was the OLD version. now they fixed toolcalling, improved coding output, and removed MXFP4 layers from 3 quants. downloading the update right now. space shooter build is first. then more experiments. full rebenchmark incoming. let's see what this thing does when it's actually optimized for the stuff i've been testing it on.

English

582

166K

Thrawling@thrawling·28 Şub

@goldenelephant1 @sudoingX Termius is really nice mobile SSH client. On your computer enable SSH, install tmux, and go.

English

Soon to be The HAND of GOD@goldenelephant1·27 Şub

@sudoingX Gotta video link to "how too"? What type of phone we talking about?

English

Sudo su@sudoingX·27 Şub

claude code on my phone through tmux. full terminal, git, agent loop, codebase access. pushed a commit from the toilet this morning. reviewed a PR from a tuktuk. the desk is dead. the laptop is optional. if you have signal, you have a dev environment. the acceleration is insane i am soo cooked lets fucking goo!

English

2.6K

Thrawling@thrawling·28 Şub

@sudoingX If you are doing actual work, and can fit it, give Qwen3.5 122B a spin. I fit the Q5 Unsloth and it’s been nice.

English

102

Sudo su@sudoingX·27 Şub

5 days ago it took 2 GPUs to build this. today it takes 1. same prompt. same particle simulation. completely different model. Qwen-Coder-Next (80B) on 2x 3090s. 46 tok/s. 564 lines. 2 iterations to get it working. 48GB VRAM across two cards just to hold it. Qwen3.5-35B-A3B on a single 3090. 112 tok/s. 461 lines. first try. cleaner code, fewer lines, better structured. 19.7GB on disk with 4GB VRAM to spare. half the parameters. one GPU instead of two. 2.4x faster. and the output actually improved. this is what happens when architecture catches up to ambition. Gated Delta Networks(Mamba2 variant) hybrid with sparse MoE. 3B active params out of 35B per token. efficiency at the architecture level, not just quantization. the curve isn't flattening. it's steepening.

Sudo su@sudoingX

told the local qwen coder to build an interactive particle simulation. single prompt. it wrote 564 lines. physics engine, mouse gravity, collision detection, connection mesh. working first try. then told it to iterate. trails. click explosions. gravity wells. bloom effects. all autonomous. reading its own code, understanding the architecture, extending it. this is Q4_K_M. 4bit quantized. 80B params compressed to 45GB. running the full claude code agent loop. file reads, bash commands, server management, multiturn iteration. on two consumer GPUs. the quantization quality at this scale genuinely caught me off guard. not just coherent. actually building real software autonomously. the gap between local and API models is narrowing faster than anyone's pricing in. open source is eating the moat from the bottom.

English

446

32.4K

Thrawling@thrawling·27 Şub

@tobitege @sudoingX Same, not time to tried his LM Studio settings yet, here’s the post of getting 59 tok/s. Maybe settings help x.com/kuittinenpetri…

Petri Kuittinen@KuittinenPetri

@sudoingX I get ~59 token/s with Beelink GTR9 Pro AMD Ryzen™ AI Max+ 395 and 128 GB LPDDR5X. I use Vulkan llama.cpp v. 2.4.0. This is probably the most cost efficient setup to run a lot of models with ease & large context window (up to 96 GB VRAM with 32+96 RAM setting).

English

109

tobitege@tobitege·27 Şub

@sudoingX on my Ryzen AI 395+ MAX 128GB even the llama.cpp did not get me above ~50 tk/s. tried LM Studio (rocM, Vulkan), Lemonade Server. Vulkan works much better than rocM for me. Codex tried many things, did self-made benches even.

English

852

Sudo su@sudoingX·27 Şub

AMD is pulling 20-30 tok/s on the same model NVIDIA hits 112-157 tok/s on. same Qwen3.5-35B-A3B. same 4-bit quant. 19.7GB on disk. fits entirely on any 24GB card. but most AMD submissions so far are Vulkan or ROCm with default configs. nobody has gone deep on tuning yet. NVIDIA's numbers climbed 2-3x once people started optimizing. 6800XT (16GB): 20-30 tok/s Ryzen AI Max+ 395 (96GB unified): 59 tok/s 3090 (24GB): 112 tok/s (was 50 before flags) 4090 (24GB): 157 tok/s (was ~80 stock) haven't tested ROCm myself yet. definitely on the list. AMD users. try llama.cpp from source with ROCm. try the cache flags. the gap might be real or it might be a config gap. only one way to find out. want to see those numbers climb.

Sudo su@sudoingX

the numbers coming in from this thread: 5090: 166 tok/s (z33b0t), 153 tok/s (EmmanuelMr) 4090: 122 tok/s (StubbyTech) 3090: 112 tok/s (sudo), 100 tok/s (Eduardo) 6800XT: 20-30 tok/s (Dark) Qwen3.5-35B-A3B. 4-bit quant, 19.7 GB on disk. fits entirely on any single 3090 24GB card with room to spare. no offloading, no splitting, full speed. 5090 owners keep pushing the ceiling and we haven't found it yet. NVIDIA side is stacking up. where are the ROCm numbers? report your GPU and tok/s below. building the full map.

English

181

33.4K

Thrawling@thrawling·27 Şub

@t_r_e_n_c_h_e_s @BLUECOW009 Nice. What’s t/s like? Eager to build out more GPU compute in home office.

English

trenches@t_r_e_n_c_h_e_s·27 Şub

@thrawling @BLUECOW009 qwen 3.5 397B today but mainly for reasoning on OCR most days I run kimi 2.5 or glm 5 (both much bigger than 400B) for pure code work yeah exactly, i use q4 normally (ud_q4_k_xl from unsloth), its amazing

English

106

@bluecow 🐮@BLUECOW009·26 Şub

Running models locally is pretty useful but reality is that most people, even developers dont have much more than 1 gpu and ~32gb of ram. The best local models in the open need >90gb vram to run, that is not a realistic expectation for general usecase

English

347

15.9K

탐색

@TeksEdge @alexellisuk @dankvr @joshiljainn @NousResearch @subtlebytes @TheAhmadOsman @sudoingX