AboveSpec (@above_spec) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

AboveSpec@above_spec·1d

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English

33

46

511

26.8K

AboveSpec@above_spec·42m

@MakJoris Oh, sorry, i had no followers like 4 days ago. People just started following me after made some gpu local llm posts.

English

0

2

Joris Mak bsky: @jorismak.nl@MakJoris·1h

@above_spec ok, since you tagged me, I'm getting a swarm of ai follow-spam having to block everyone. What's in your circles??

English

1

0

4

AboveSpec@above_spec·1d

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English

33

46

511

26.8K

AboveSpec@above_spec·2h

@HealthRanger Correction: 24tps was for 35b MOE model!

English

0

24

AboveSpec@above_spec·2h

@HealthRanger How many tokens per sec do you get on those machines, Mike? Isn't 27b very slow when running on just CPU? Although I did saw someone on X mentioning they get 24 tps on 9950x with no gpu.

English

1

0

4

182

HealthRanger@HealthRanger·4h

If you want to run local inference with Qwen 3.6-27b or other excellent medium-sized models without buying huge, bulky, expensive workstations and NVIDIA GPUs, I've found that the GMKtec EVO-X2 Mini PC (based on AMD Ryzen with 128GB of unified RAM) is very, very good. It's small, quiet and uses very little electricity. It runs LM Studio, Ollama or other inference software, and it's fast enough with Qwen models to make it practical and usable. I've had one running for about 30 days now, non-stop, with zero issues, running inference 24/7. It has enough RAM to run even 120 billion parameter models. In my mini data center, I have this replacing bulkier, more power-hungry workstations. Only downside? It doesn't handle the common image generation models, nor video generation. But for text-based inference, it's solid, and it works with all the common text models like Qwen. Expect to pay around $3300 for this unit right now. That price will probably rise soon due to RAM shortages, resulting from the over-investment bubble into AI data centers.

English

10

3

43

2.7K

AboveSpec@above_spec·4h

@VitaliiKhomenk1 Yeah only AM5. Not an Epyc.

English

0

29

Vitalii Khomenko@VitaliiKhomenk1·4h

@above_spec Yeah, something big is happening—a complete miss on the initial hardware components… )))))))

English

1

0

1

34

AboveSpec@above_spec·18h

Something big is coming!

English

7

0

14

982

AboveSpec@above_spec·4h

@lollipop_stat Incredible! What CPU? So far I have only seen 3090s and higher do speculative decoding. My limited testing on 16gb gpus hasn't unlocked any gains. Maybe it would work on smaller 9b models. 0.8+9b

English

1

0

1

31

Lollipop@lollipop_stat·4h

@above_spec Running at 200 prefill/sec and 24 inference token/sec on CPU only, 64 ddr5 RAM, same config as you but Q4. I cannot find a way to apply speculative decoding as It Is slower when using qwen 3.x 0.8/1.7 as vocabulary Is different, only 67% hit rate. Do you have a spec deck lead?

English

1

0

1

36

AboveSpec@above_spec·7h

@ProofOfPrints No

0

56

Proof of Prints@ProofOfPrints·8h

@above_spec The real question is to RGB or not to RGB.

English

1

0

1

84

AboveSpec@above_spec·10h

@ajaeger74 Amazing! iq4 should be better precision.

English

0

1

39

𝔸𝕟𝕕𝕚 𝕁𝕒𝕖𝕘𝕖𝕣📯📸@ajaeger74·10h

ik_llama.cpp/build/bin/llama-server -m Qwen3.6-35B-A3B-IQ4_NL.gguf -c 262144 --cache-type-k q8_0 --cache-type-v iq4_nl -n -1 -np 1 -t 32 -b 128 -ngl 100 --flash-attn on --slot-save-path /dev/shm/llama-slots Around 200 TPS on an RTX 5090, with only 21776/32607 MB VRAM usage, so plenty of headroom left Shoutout to @above_spec who made the IQ3 huggingface.co/ajaeger74/Qwen…

English

1

0

1

123

AboveSpec@above_spec·10h

Hardware: RTX 50-series with 16GB VRAM, Ubuntu 24.04 **Step 1 — Install nvoc** ```bash git clone github.com/martinstark/nv… cd nvoc && cargo build --release sudo cp target/release/nvoc /usr/local/bin/ ``` **Step 2 — Build ik_llama.cpp** ```bash git clone github.com/ikawrakow/ik_l… cd ik_llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j$(nproc) ``` **Step 3 — Apply OC + benchmark** ```bash sudo nvoc -m 5000 llama-bench \ -m Qwen3.6-27B-IQ3_K_R4.gguf \ -ngl 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -p 0 -n 128 -r 3 ``` **Step 4 — Run as a server** ```bash llama-server \ -m Qwen3.6-27B-IQ3_K_R4.gguf \ -ngl 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -c 131072 --temp 0.6 --jinja --port 8080 ``` Model on HuggingFace: huggingface.co/abovespec/Qwen… Full results + replication guide: github.com/abovespec/loca… Driver version tested: 580.126.20.

English

1

5

408

AboveSpec@above_spec·10h

Quick sanity check: does ik_llama.cpp actually matter for dense 27B, or is mainline just as good? Ran Unsloth's UD-IQ3_XXS (3.06 bpw, 11.2 GiB) on mainline llama.cpp across the same context depths: Mainline average: **~28.6 t/s** ik_llama.cpp IQ3_K_R4 average: **~28.3 t/s** Essentially identical. Flat in both cases, 0 → 139k. **ik_llama.cpp's advantage is specific to MoE models + R4 quants.** For the 35B MoE we hit 128 t/s — that's because of `--n-cpu-moe` freeing VRAM and the IQ3_K_R4 custom CUDA kernels. On a dense model with standard quants, both engines are running the same CUDA code and the results match. If you're running a dense model, mainline llama.cpp is fine.

English

1

5

413

AboveSpec@above_spec·10h

RTX 5060 Ti 16GB. Free +19% token speed with one command. Benchmarked Qwen3.6-27B IQ3_K_R4 at every memory OC level from stock to +5000 MHz — flat, stable, all the way to 139k context. Still running your GPU on stock settings? 🧵

English

6

4

82

4.7K

AboveSpec@above_spec·11h

@georgecursor $429 MSRP in the US. Prices have been rising though. Around $500 is the lowest right now.

English

0

1

19

George Saoulidis@georgecursor·13h

Oh hey I'm running Qwen 3.5 on my 5060 Ti! I kinda need the workstation for other things so I don't need to squeeze everything out of it. But I might try this. Also my card costs 700 euro. Don't know where you got that 400 euro price.

AboveSpec@above_spec

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English

1

0

1

100

AboveSpec@above_spec·16h

I am running Asus B650 Creator ProArt. Don't think it does much to improve the numbers. It supports only PCIE 4.0, 5000 series can go up to PCIE 5.0, but 5060 ti is limited to x8 anyways (not x16 like 5070 and better gpus). 5060 ti 16gb isn't that much better than 4060 ti 16gb, I wouldn't upgrade. B70 should be interesting.

English

1

0

24

keithofaptos@keithofaptos·16h

That motherboard is helping them numbers I'd imagine also. Which is great . Thx for your share. This local Ai and these models are getting more awesome by the month. Loving it. I'm setting up a couple systems myself. A 4060 ti 16. A few other really older cards. And a B70 I'm about to figure out to. But I'm working towards what you’re working on. So I'll be listening.

English

1

0

1

31

AboveSpec@above_spec·17h

@ProofOfPrints I see! I usually use Onshape and saw ppl on x use claude on onshape. Have you tried giving it more prompts to make it more sophisticated? Fit 120mm fans everywhere, lol.

English

1

0

14

Proof of Prints@ProofOfPrints·17h

@above_spec I actually tried Claude to create a case in FreeCAD and it came out pretty basic with no unity of parts. I may use it as a template if all measurement were correct. Let me know if you have any luck.

English

1

0

1

28

Proof of Prints@ProofOfPrints·1d

And so it begins. I need to print a case…

English

2

27

662

AboveSpec@above_spec·17h

@ItsmeAjayKV Interesting!

English

1

0

24

AJ@ItsmeAjayKV·1d

Heterogeneous speculative decoding aka Dovetail method (Ingenious naming btw). arXiv:2412.18934 How this is different from standard speculative decoding is that unlike SD where target and draft model lives on GPU, here draft is the only one on GPU, while the target lives on CPU doing validation. i.e GPU drafts, CPU verifies. Interesting part about this paper that gives me and you massive hopes is that it runs the models (in paper its older models) on very cheap and affordable consumer hardware.

Sakura Yuki@sakurayukiai

We've been doing speculative decoding completely backwards. The Dovetail paper keeps the draft model on the GPU for fast generation, and dumps the massive target model on your CPU where cheap system RAM is totally fine for a single parallel verification pass.

English

1

0

1

123

AboveSpec@above_spec·17h

@keithofaptos It may be more now, I am from Canada and got it for $579 CAD last year, which is around $400 USD. Looking at your link, looks like the last time they were $429 was in December... There is a refurbished PNY for $499 on newegg USA: newegg.com/pny-technologi…

English

1

0

1

261

keithofaptos@keithofaptos·17h

@above_spec Where do I find this gpu at that cost? bestvaluegpu.com/history/new-an…

English

1

0

1

281

AboveSpec

Keşfet