AboveSpec

0

7

witcheer ☯︎@witcheer·36m

@above_spec you are the second one advising this, testing today

English

0

1

17

witcheer ☯︎@witcheer·4h

follow-up on my local agent stack (RTX 4060 Ti 8 GB, Qwen3.6-35B-A3B) found the hidden VRAM killer: context checkpoints. llama.cpp creates a checkpoint every 8192 tokens during prefill (~63 MiB each). at 26K context that's 5 checkpoints = 315 MiB silently eating GPU memory. on 8 GB VRAM this pushes past the 7 GB cliff. >with default checkpoints, 26K context → 14.6 tok/s >with --checkpoint-every-n-tokens -1 → 30.3 tok/s full sweep without checkpoints: - 3.5K → 34.2 tok/s - 20K → 31.5 tok/s - 26K → 30.3 tok/s - 35K → 29.6 tok/s - 50K → 27.7 tok/s smooth degradation, no cliff anywhere. the ~2 tok/s drop per 10K tokens is from attention scaling on Qwen3.6's 10/40 attention layers. I also switched from Hermes to Pi as the agent harness. ~1.9K token system prompt vs Hermes's ~13.5K. in thiscase, Pi is more suitable here. Hermes is still my main agent on my mac mini tho. final config: ``` turboquant fork, -ncmoe 30 -c 65536 -np 1 --cache-type-k q8_0 --cache-type-v turbo3 --no-cache-prompt --checkpoint-every-n-tokens -1 ```

witcheer ☯︎@witcheer

update on mylocal agent stack (RTX 4060 Ti 8 GB, Qwen3.6-35B-A3B Q4_K_M) my initial problem was that 64K context on standard llama.cpp killed speed. V cache q4_0 pushed graph splits from 62 → 82, and Hermes decode dropped from 31 → 9-11 tok/s. unusable for real agent work. some people in comments recommended trying turboquant fork. turbo2/turbo3 KV cache types keep 62 graph splits at 64K context. auto-asymmetric: K stays q8_0, only V gets compressed. turbo3 wins. same speed as 32K config but double the context window. usable context in Hermes jumps from ~18.5K to ~50.5K. new daily-driver config: -ngl 999 -ncmoe 30 -c 65536 -np 1 -fa on --cache type-k q8_0 --cache-type-v turbo3 8 GB VRAM is not dead. you need the right fork.

English

10

3

62

3.9K

AboveSpec@above_spec·48m

@notmirmujtaba No

1

Mir Mujtaba@notmirmujtaba·1h

@above_spec can this run on RTX 3050 (4GB VRAM) and RYZEN 7 (16GB RAM)?? please tell me!

English

0

4

AboveSpec@above_spec·5d

Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English

12

14

140

17.1K

AboveSpec@above_spec·13h

@Tono_Ken3 Did you get 47 tok/s without MTP?

English

都乃健🇯🇵文明航海士©｜とのけん3@Tono_Ken3

9

都乃健🇯🇵文明航海士©｜とのけん3@Tono_Ken3·14h

おぉ! 5分でMTPを実装した。さすが本人

日本語

2

0

2

381

都乃健🇯🇵文明航海士©｜とのけん3@Tono_Ken3·15h

つまり、DeepSeek-V4-Flash-IQ2XXSをRTXPro6KでドライブしKVキャッシュはSSDに持たせる Sonnet4.6くらいの戦闘力をもつClaudeCodeをコンテキスト長さ1Mかつまともな速度でローカルで使える道がついたということです帯域スペック的に40token/s くらいは行くと思うんだよな。いやもっとか

DS4=DwarfStar4 OpencodeのDS-4VFによればIQ2モデルで単騎推論とすればRTXPro6K対応可能とのことで早速コンパイル完了モデルをDLする。素晴らしい展開 github.com/antirez/ds4

日本語

2

8

80

16.6K

AboveSpec@above_spec·1d

@witcheer Here is my setup for 4060 ti 8gb, I can fit whole 262k and get 49-55 tok/s with ik_llama.cpp: x.com/i/status/20522…

Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵

English

0

19

1.2K

witcheer ☯︎@witcheer·1d

I ran Hermes agent (v0.13.0) with qwen3.6-35B-A3B on my RTX 4060 Ti 8GB for the first time today. full local agent stack. my question was: can a local 3B-active MoE model actually drive an agent harness end-to-end? quickly, my setup: >WSL2 Ubuntu 26.04 → CUDA 13.2 → llama.cpp (b9049) → llama-server → Hermes Agent >model: qwen3.6-35B-A3B-UD-Q4_K_M >config: -ngl 999 -ncmoe 30 -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 >baseline decode: 35.36 tok/s (from prior -ncmoe sweep) I tested 4 rounds, easy to hard: 1. single tool call (list files) - pass, 31.4 tok/s 2. 5 chained tool calls (mkdir → venv → pip → write script → run) - pass, self-corrected a path error 3. read 10 files from windows via /mnt/c/ - pass when scoped, fail when hermes read full files 4. write a 95-line python CLI with argparse, then run it - pass, genuinely usable code my biggest issue: the context. hermes system prompt eats ~13.5K tokens. out of 32K, that leaves ~18.5K usable. a multi-step task fills that in 3-4 exchanges. when I pushed it, hermes tried to compress via the same qwen model → slot contention → timeout → retry storm → ctrl+c. and also, hermes has a 64K minimum context gate - needs a config override to run with 32K my conclusion: hermes + qwen3.6-35B-A3B is a capable local agent for short automated tasks, code gen, file ops, cron jobs. 4-5 tool calls per session, but not viable for long multi-turn sessions. context fills too fast, compression self-destructs, VRAM cliff halves speed before you hit the wall. ---- I am curious if anyone's running hermes agent with a local model on similar hardware (8-12 GB VRAM). what model are you pairing it with? how do you handle the context ceiling? I am especially interested in setups that solve the compression-model problem (separate lightweight model for context compression).

witcheer ☯︎@witcheer

now testing real results with Hermes on WSL2

English

37

11

177

18.5K

AboveSpec@above_spec·2d

@FStrongpaw So nice! Which proart motherboard is that? I have B650 Proart as daily driver right now and looks like I could do similar.

English

0

75

Fatherfox Strongpaw@FStrongpaw·3d

holy crap! the symmetry..the symmetry.. what a difference mached gpu's makes! AAAHHH!!!! now i have to redownload all the models i deleted and retest Everyhting i've rejected for the last 4 months! oh shit... what if all my ai's actually... *gasp* work? 😱 i'm screwed 😅

English

0

1

113

AboveSpec@above_spec·4d

@aliez_ren They need to make a workstation version with fans.

English

Don’t just scale AI. Scale ROI. AMD Instinct MI350P PCIe cards deliver 144 GB of HBM3E memory and up to 2299 teraFLOPS (at MXFP4) in a drop-in, air-cooled card built for standard servers. That’s how you scale AI at maximum ROI without redesigning your data center. Interested in drop-in AMD Instinct MI350P PCIe cards? See the specs at the link: bit.ly/4exiAg2

1

711

Aliez Ren@aliez_ren·5d

卧槽无敌，秒杀 RTX Pro 6000

AMD@AMD

中文

13

5

85

39.5K

AboveSpec@above_spec·4d

@sudoingX Or run 35b a3b with cpu offload then even 8gb can be decent at 50-55t/s with full 262k context at q4: x.com/i/status/20522…

Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵

English

1

21

1.8K

Sudo su@sudoingX·4d

it's so easy to get started in local ai actually. the only real wall is vram math. practical heuristic for a single gpu: > 24gb = 27B Q4_K_M at 262k context (qwen 3.6, carnice-v2) > 16gb = 13B Q5_K_M at 32k or 9B Q8_0 at 64k > 12gb = 8B Q5_K_M at 16k > 8gb = 4B Q4_K_M at 8k quantization rule of thumb: Q4_K_M ≈ 0.6 gb per billion params. kv cache scales with context. add 1 gb activation buffer. that's the math. every other piece (llama.cpp build, hermes agent setup, prompt config) is one good day setup. the math is the only ongoing constraint. once you can eyeball this for your gpu, you can pick any model + context combo with confidence. stop being intimidated by the stack.

English

32

44

576

27K

AboveSpec@above_spec·4d

@Crucialhunter 2080 is a solid 8gb card

English

1

225

Tomás Crucial@Crucialhunter·5d

@above_spec I'm on a 2080 , and being able to squeeze it a little bit longer sounds great as right now upgrading does not sound worth it vs cloud models

English

0

2

447

AboveSpec@above_spec·4d

@ItsmeAjayKV @ahoenecke You can always have more, I wish I had enough to run Deepseek V4 Pro, lol!

English

0

2

83

AJ@ItsmeAjayKV·4d

@above_spec @ahoenecke I wish i had 4 more GB of VRAM now...

English

0

2

53

Hoenecke@ahoenecke·5d

I'm getting 120 tok/s on a 3090!

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English

0

6

262

AboveSpec@above_spec·5d

Here is the iq4_k_r4 model for this recipe, so you don't have to quantize yourself: huggingface.co/abovespec/Qwen…

English

1

16

869

AboveSpec@above_spec·5d

@witcheer Ncmoe 32 can get you even up to 55 t/s.

English

63

AboveSpec@above_spec·5d

@witcheer Here is how to get ~50 t/s while having full 262k context with your setup -- 4060 ti 8gb, AM5, DDR5. 50t/s flat across all context. x.com/i/status/20522…

Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵

English

0

3

381

witcheer ☯︎@witcheer·5d

ran tests with Qwen3.6-35B-A3B-UD-Q4_K_M as main local model for Hermes: >Hardware: RTX 4060 Ti 8GB VRAM, Ryzen 5 7600X, 32GB DDR5-6000 > Model: unsloth/Qwen3.6-35B-A3B-UD-Q4_K_M (22.1 GB on disk, 20.60 GiB GGUF) >Runtime: llama.cpp llama-server (build b9049-2496f9c14, ggml 0.11.0) >OS: WSL2 Ubuntu 26.04 on Windows 11 >CUDA: 13.2.1, compute capability 8.9 for Hermes daily use, the optimal config is: ``` -ngl 999 -ncmoe 30 -fa on --cache-type-k q8_0 --cache-type-v q8_0 -c 32768 -t 6 ```

witcheer ☯︎@witcheer

study @Teknium: >me asking him the best way to host Hermes on windows >him explaining that WSL2 is the preferred way right now >him sending a previous NousResearch documentation about the set up >him deciding that it is too sparse and reworking the documentation >1 hour later him coming back to me with a very comprehensive tutoral on how tu run Hermes on WSL2 Hermes agent is #1 and there is no second best. for those who are interested in the documentation: hermes-agent.nousresearch.com/docs/user-guid…

English

13

9

98

10.5K

AboveSpec@above_spec·5d

@Frudoheili Yes, need to test qwopus. Not enough hours in a day!

English

329

Rostislav Alexandrovich@Frudoheili·5d

@above_spec This vs Qweopus

English

0

1

310

AboveSpec@above_spec·5d

@mindinpanic Tough to get good performance with 4gb, but you can try your best. Use ik_llama.cpp as it's best for cpu offload.

English

1

391

Volodymyr Pavlenko@mindinpanic·5d

@above_spec sir im poormaxxing can I run something similar on AMD Radeon Pro 5300M with 4GB VRAM and Intel Core i9 and 32bg ram?

English

0

1

451

AboveSpec@above_spec·5d

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English

24

54

482

44.2K

AboveSpec@above_spec·5d

@chrisdrit Ryzen 9 7900x, B650M ProArt Creator, 96Gb DDR5 5600Mhz

Čeština

1

44

Chris@chrisdrit·5d

@above_spec @above_spec that's amazing, what are the rest of the spec's on your rig? CPU / Memory, etc...

English

0

1

32

AboveSpec@above_spec·3 May

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English

36

48

525

29.7K

AboveSpec@above_spec·5d

@ab_jpeg Yes, works well for me.

English

1

133

.🫟@ab_jpeg·5d

@above_spec i’m assuming tool calling quality is fine at this quant?

English

@DobrinGeorgie10 github.com/ikawrakow/ik_l…

0

1

148

AboveSpec@above_spec·5d

QME

1

41

Imm0rta1@DobrinGeorgie10·5d

@above_spec Im trying to find the exact setup llama.cpp repo , settings etc. is there any link to i can read

English

0

1

89

AboveSpec@above_spec·1 May

"You need a 24 GB GPU for serious local LLMs in 2026." Everyone repeats this. It's not true anymore. Just ran a 35B-parameter model on an RTX 4060 Ti 8 GB: • 41 tok/s at 16k context • 24 tok/s at 200k context Recipe + benchmarks below 🧵

English

135

230

2.8K

279.2K

AboveSpec@above_spec·5d

@doktor_DeFi You should get much faster speeds than me, esp if you have dd5

English