AboveSpec

287 posts

AboveSpec banner
AboveSpec

AboveSpec

@above_spec

Love 3d printing, playing with local llms and learning Claude Code

Ontario, Canada Entrou em Aralık 2017
174 Seguindo1K Seguidores
AboveSpec
AboveSpec@above_spec·
@witcheer Yeah, I was posting benchmarks and someone advised to try it. Using Pi is the right direction too, I testes it myself a few days ago and it's so lightweight and responsive, makes local models feel good.
English
1
0
0
7
witcheer ☯︎
witcheer ☯︎@witcheer·
follow-up on my local agent stack (RTX 4060 Ti 8 GB, Qwen3.6-35B-A3B) found the hidden VRAM killer: context checkpoints. llama.cpp creates a checkpoint every 8192 tokens during prefill (~63 MiB each). at 26K context that's 5 checkpoints = 315 MiB silently eating GPU memory. on 8 GB VRAM this pushes past the 7 GB cliff. >with default checkpoints, 26K context → 14.6 tok/s >with --checkpoint-every-n-tokens -1 → 30.3 tok/s full sweep without checkpoints: - 3.5K → 34.2 tok/s - 20K → 31.5 tok/s - 26K → 30.3 tok/s - 35K → 29.6 tok/s - 50K → 27.7 tok/s smooth degradation, no cliff anywhere. the ~2 tok/s drop per 10K tokens is from attention scaling on Qwen3.6's 10/40 attention layers. I also switched from Hermes to Pi as the agent harness. ~1.9K token system prompt vs Hermes's ~13.5K. in thiscase, Pi is more suitable here. Hermes is still my main agent on my mac mini tho. final config: ``` turboquant fork, -ncmoe 30 -c 65536 -np 1 --cache-type-k q8_0 --cache-type-v turbo3 --no-cache-prompt --checkpoint-every-n-tokens -1 ```
witcheer ☯︎ tweet media
witcheer ☯︎@witcheer

update on mylocal agent stack (RTX 4060 Ti 8 GB, Qwen3.6-35B-A3B Q4_K_M) my initial problem was that 64K context on standard llama.cpp killed speed. V cache q4_0 pushed graph splits from 62 → 82, and Hermes decode dropped from 31 → 9-11 tok/s. unusable for real agent work. some people in comments recommended trying turboquant fork. turbo2/turbo3 KV cache types keep 62 graph splits at 64K context. auto-asymmetric: K stays q8_0, only V gets compressed. turbo3 wins. same speed as 32K config but double the context window. usable context in Hermes jumps from ~18.5K to ~50.5K. new daily-driver config: -ngl 999 -ncmoe 30 -c 65536 -np 1 -fa on --cache type-k q8_0 --cache-type-v turbo3 8 GB VRAM is not dead. you need the right fork.

English
10
3
62
3.9K
Mir Mujtaba
Mir Mujtaba@notmirmujtaba·
@above_spec can this run on RTX 3050 (4GB VRAM) and RYZEN 7 (16GB RAM)?? please tell me!
English
1
0
0
4
AboveSpec
AboveSpec@above_spec·
Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵
AboveSpec tweet media
AboveSpec@above_spec

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English
12
14
140
17.1K
都乃健🇯🇵文明航海士©|とのけん3
つまり、DeepSeek-V4-Flash-IQ2XXSをRTXPro6KでドライブしKVキャッシュはSSDに持たせる Sonnet4.6くらいの戦闘力をもつClaudeCodeをコンテキスト長さ1Mかつまともな速度でローカルで使える道がついたということです 帯域スペック的に40token/s くらいは行くと思うんだよな。いやもっとか
都乃健🇯🇵文明航海士©|とのけん3@Tono_Ken3

DS4=DwarfStar4 OpencodeのDS-4VFによればIQ2モデルで単騎推論とすればRTXPro6K対応可能とのことで早速コンパイル完了 モデルをDLする。素晴らしい展開 github.com/antirez/ds4

日本語
2
8
80
16.6K
witcheer ☯︎
witcheer ☯︎@witcheer·
I ran Hermes agent (v0.13.0) with qwen3.6-35B-A3B on my RTX 4060 Ti 8GB for the first time today. full local agent stack. my question was: can a local 3B-active MoE model actually drive an agent harness end-to-end? quickly, my setup: >WSL2 Ubuntu 26.04 → CUDA 13.2 → llama.cpp (b9049) → llama-server → Hermes Agent >model: qwen3.6-35B-A3B-UD-Q4_K_M >config: -ngl 999 -ncmoe 30 -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 >baseline decode: 35.36 tok/s (from prior -ncmoe sweep) I tested 4 rounds, easy to hard: 1. single tool call (list files) - pass, 31.4 tok/s 2. 5 chained tool calls (mkdir → venv → pip → write script → run) - pass, self-corrected a path error 3. read 10 files from windows via /mnt/c/ - pass when scoped, fail when hermes read full files 4. write a 95-line python CLI with argparse, then run it - pass, genuinely usable code my biggest issue: the context. hermes system prompt eats ~13.5K tokens. out of 32K, that leaves ~18.5K usable. a multi-step task fills that in 3-4 exchanges. when I pushed it, hermes tried to compress via the same qwen model → slot contention → timeout → retry storm → ctrl+c. and also, hermes has a 64K minimum context gate - needs a config override to run with 32K my conclusion: hermes + qwen3.6-35B-A3B is a capable local agent for short automated tasks, code gen, file ops, cron jobs. 4-5 tool calls per session, but not viable for long multi-turn sessions. context fills too fast, compression self-destructs, VRAM cliff halves speed before you hit the wall. ---- I am curious if anyone's running hermes agent with a local model on similar hardware (8-12 GB VRAM). what model are you pairing it with? how do you handle the context ceiling? I am especially interested in setups that solve the compression-model problem (separate lightweight model for context compression).
witcheer ☯︎@witcheer

now testing real results with Hermes on WSL2

English
37
11
177
18.5K
AboveSpec
AboveSpec@above_spec·
@FStrongpaw So nice! Which proart motherboard is that? I have B650 Proart as daily driver right now and looks like I could do similar.
English
1
0
0
75
Fatherfox Strongpaw
Fatherfox Strongpaw@FStrongpaw·
holy crap! the symmetry..the symmetry.. what a difference mached gpu's makes! AAAHHH!!!! now i have to redownload all the models i deleted and retest Everyhting i've rejected for the last 4 months! oh shit... what if all my ai's actually... *gasp* work? 😱 i'm screwed 😅
Fatherfox Strongpaw tweet media
English
1
0
1
113
AboveSpec
AboveSpec@above_spec·
@aliez_ren They need to make a workstation version with fans.
English
0
0
1
711
Sudo su
Sudo su@sudoingX·
it's so easy to get started in local ai actually. the only real wall is vram math. practical heuristic for a single gpu: > 24gb = 27B Q4_K_M at 262k context (qwen 3.6, carnice-v2) > 16gb = 13B Q5_K_M at 32k or 9B Q8_0 at 64k > 12gb = 8B Q5_K_M at 16k > 8gb = 4B Q4_K_M at 8k quantization rule of thumb: Q4_K_M ≈ 0.6 gb per billion params. kv cache scales with context. add 1 gb activation buffer. that's the math. every other piece (llama.cpp build, hermes agent setup, prompt config) is one good day setup. the math is the only ongoing constraint. once you can eyeball this for your gpu, you can pick any model + context combo with confidence. stop being intimidated by the stack.
English
32
44
576
27K
Tomás Crucial
Tomás Crucial@Crucialhunter·
@above_spec I'm on a 2080 , and being able to squeeze it a little bit longer sounds great as right now upgrading does not sound worth it vs cloud models
English
1
0
2
447
AboveSpec
AboveSpec@above_spec·
@witcheer Ncmoe 32 can get you even up to 55 t/s.
English
0
0
0
63
witcheer ☯︎
witcheer ☯︎@witcheer·
ran tests with Qwen3.6-35B-A3B-UD-Q4_K_M as main local model for Hermes: >Hardware: RTX 4060 Ti 8GB VRAM, Ryzen 5 7600X, 32GB DDR5-6000 > Model: unsloth/Qwen3.6-35B-A3B-UD-Q4_K_M (22.1 GB on disk, 20.60 GiB GGUF) >Runtime: llama.cpp llama-server (build b9049-2496f9c14, ggml 0.11.0) >OS: WSL2 Ubuntu 26.04 on Windows 11 >CUDA: 13.2.1, compute capability 8.9 for Hermes daily use, the optimal config is: ``` -ngl 999 -ncmoe 30 -fa on --cache-type-k q8_0 --cache-type-v q8_0 -c 32768 -t 6 ```
witcheer ☯︎ tweet media
witcheer ☯︎@witcheer

study @Teknium: >me asking him the best way to host Hermes on windows >him explaining that WSL2 is the preferred way right now >him sending a previous NousResearch documentation about the set up >him deciding that it is too sparse and reworking the documentation >1 hour later him coming back to me with a very comprehensive tutoral on how tu run Hermes on WSL2 Hermes agent is #1 and there is no second best. for those who are interested in the documentation: hermes-agent.nousresearch.com/docs/user-guid…

English
13
9
98
10.5K
AboveSpec
AboveSpec@above_spec·
@Frudoheili Yes, need to test qwopus. Not enough hours in a day!
English
0
0
0
329
AboveSpec
AboveSpec@above_spec·
@mindinpanic Tough to get good performance with 4gb, but you can try your best. Use ik_llama.cpp as it's best for cpu offload.
English
0
0
1
391
Volodymyr Pavlenko
Volodymyr Pavlenko@mindinpanic·
@above_spec sir im poormaxxing can I run something similar on AMD Radeon Pro 5300M with 4GB VRAM and Intel Core i9 and 32bg ram?
English
1
0
1
451
AboveSpec
AboveSpec@above_spec·
Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵
AboveSpec tweet media
English
24
54
482
44.2K
AboveSpec
AboveSpec@above_spec·
@chrisdrit Ryzen 9 7900x, B650M ProArt Creator, 96Gb DDR5 5600Mhz
Čeština
0
0
1
44
Chris
Chris@chrisdrit·
@above_spec @above_spec that's amazing, what are the rest of the spec's on your rig? CPU / Memory, etc...
English
1
0
1
32
AboveSpec
AboveSpec@above_spec·
RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵
AboveSpec tweet media
English
36
48
525
29.7K
.🫟
.🫟@ab_jpeg·
@above_spec i’m assuming tool calling quality is fine at this quant?
English
1
0
1
148
Imm0rta1
Imm0rta1@DobrinGeorgie10·
@above_spec Im trying to find the exact setup llama.cpp repo , settings etc. is there any link to i can read
English
1
0
1
89
AboveSpec
AboveSpec@above_spec·
"You need a 24 GB GPU for serious local LLMs in 2026." Everyone repeats this. It's not true anymore. Just ran a 35B-parameter model on an RTX 4060 Ti 8 GB: • 41 tok/s at 16k context • 24 tok/s at 200k context Recipe + benchmarks below 🧵
AboveSpec tweet media
English
135
230
2.8K
279.2K
AboveSpec
AboveSpec@above_spec·
@doktor_DeFi You should get much faster speeds than me, esp if you have dd5
English
0
0
1
34
Doktor Funk
Doktor Funk@doktor_DeFi·
Thanks for sharing this stuff. I'm really curious to try it on my 4060ti 16gb. Only 32gb ram and you know I have a zillion tabs open. Speed looks great but practical assesment and use, how is that holding up? Any trade offs, hallucinations, loops, sub-par results? Are you testing agentic stuff or tool use? Very interested to know.
English
2
0
1
601