AboveSpec

232 posts

AboveSpec banner
AboveSpec

AboveSpec

@above_spec

Love 3d printing, playing with local llms and learning Claude Code

Ontario, Canada Katılım Aralık 2017
162 Takip Edilen761 Takipçiler
Sabitlenmiş Tweet
AboveSpec
AboveSpec@above_spec·
RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵
AboveSpec tweet media
English
33
46
511
26.8K
AboveSpec
AboveSpec@above_spec·
@MakJoris Oh, sorry, i had no followers like 4 days ago. People just started following me after made some gpu local llm posts.
English
0
0
0
2
AboveSpec
AboveSpec@above_spec·
RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵
AboveSpec tweet media
English
33
46
511
26.8K
AboveSpec
AboveSpec@above_spec·
@HealthRanger How many tokens per sec do you get on those machines, Mike? Isn't 27b very slow when running on just CPU? Although I did saw someone on X mentioning they get 24 tps on 9950x with no gpu.
English
1
0
4
182
HealthRanger
HealthRanger@HealthRanger·
If you want to run local inference with Qwen 3.6-27b or other excellent medium-sized models without buying huge, bulky, expensive workstations and NVIDIA GPUs, I've found that the GMKtec EVO-X2 Mini PC (based on AMD Ryzen with 128GB of unified RAM) is very, very good. It's small, quiet and uses very little electricity. It runs LM Studio, Ollama or other inference software, and it's fast enough with Qwen models to make it practical and usable. I've had one running for about 30 days now, non-stop, with zero issues, running inference 24/7. It has enough RAM to run even 120 billion parameter models. In my mini data center, I have this replacing bulkier, more power-hungry workstations. Only downside? It doesn't handle the common image generation models, nor video generation. But for text-based inference, it's solid, and it works with all the common text models like Qwen. Expect to pay around $3300 for this unit right now. That price will probably rise soon due to RAM shortages, resulting from the over-investment bubble into AI data centers.
English
10
3
43
2.7K
Vitalii Khomenko
Vitalii Khomenko@VitaliiKhomenk1·
@above_spec Yeah, something big is happening—a complete miss on the initial hardware components… )))))))
English
1
0
1
34
AboveSpec
AboveSpec@above_spec·
Something big is coming!
AboveSpec tweet mediaAboveSpec tweet media
English
7
0
14
982
AboveSpec
AboveSpec@above_spec·
@lollipop_stat Incredible! What CPU? So far I have only seen 3090s and higher do speculative decoding. My limited testing on 16gb gpus hasn't unlocked any gains. Maybe it would work on smaller 9b models. 0.8+9b
English
1
0
1
31
Lollipop
Lollipop@lollipop_stat·
@above_spec Running at 200 prefill/sec and 24 inference token/sec on CPU only, 64 ddr5 RAM, same config as you but Q4. I cannot find a way to apply speculative decoding as It Is slower when using qwen 3.x 0.8/1.7 as vocabulary Is different, only 67% hit rate. Do you have a spec deck lead?
English
1
0
1
36
AboveSpec
AboveSpec@above_spec·
Hardware: RTX 50-series with 16GB VRAM, Ubuntu 24.04 **Step 1 — Install nvoc** ```bash git clone github.com/martinstark/nv… cd nvoc && cargo build --release sudo cp target/release/nvoc /usr/local/bin/ ``` **Step 2 — Build ik_llama.cpp** ```bash git clone github.com/ikawrakow/ik_l… cd ik_llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j$(nproc) ``` **Step 3 — Apply OC + benchmark** ```bash sudo nvoc -m 5000 llama-bench \ -m Qwen3.6-27B-IQ3_K_R4.gguf \ -ngl 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -p 0 -n 128 -r 3 ``` **Step 4 — Run as a server** ```bash llama-server \ -m Qwen3.6-27B-IQ3_K_R4.gguf \ -ngl 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -c 131072 --temp 0.6 --jinja --port 8080 ``` Model on HuggingFace: huggingface.co/abovespec/Qwen… Full results + replication guide: github.com/abovespec/loca… Driver version tested: 580.126.20.
English
1
1
5
408
AboveSpec
AboveSpec@above_spec·
Quick sanity check: does ik_llama.cpp actually matter for dense 27B, or is mainline just as good? Ran Unsloth's UD-IQ3_XXS (3.06 bpw, 11.2 GiB) on mainline llama.cpp across the same context depths: Mainline average: **~28.6 t/s** ik_llama.cpp IQ3_K_R4 average: **~28.3 t/s** Essentially identical. Flat in both cases, 0 → 139k. **ik_llama.cpp's advantage is specific to MoE models + R4 quants.** For the 35B MoE we hit 128 t/s — that's because of `--n-cpu-moe` freeing VRAM and the IQ3_K_R4 custom CUDA kernels. On a dense model with standard quants, both engines are running the same CUDA code and the results match. If you're running a dense model, mainline llama.cpp is fine.
English
1
1
5
413
AboveSpec
AboveSpec@above_spec·
RTX 5060 Ti 16GB. Free +19% token speed with one command. Benchmarked Qwen3.6-27B IQ3_K_R4 at every memory OC level from stock to +5000 MHz — flat, stable, all the way to 139k context. Still running your GPU on stock settings? 🧵
AboveSpec tweet media
English
6
4
82
4.7K
AboveSpec
AboveSpec@above_spec·
@georgecursor $429 MSRP in the US. Prices have been rising though. Around $500 is the lowest right now.
English
0
0
1
19
George Saoulidis
George Saoulidis@georgecursor·
Oh hey I'm running Qwen 3.5 on my 5060 Ti! I kinda need the workstation for other things so I don't need to squeeze everything out of it. But I might try this. Also my card costs 700 euro. Don't know where you got that 400 euro price.
AboveSpec@above_spec

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English
1
0
1
100
AboveSpec
AboveSpec@above_spec·
I am running Asus B650 Creator ProArt. Don't think it does much to improve the numbers. It supports only PCIE 4.0, 5000 series can go up to PCIE 5.0, but 5060 ti is limited to x8 anyways (not x16 like 5070 and better gpus). 5060 ti 16gb isn't that much better than 4060 ti 16gb, I wouldn't upgrade. B70 should be interesting.
English
1
0
0
24
keithofaptos
keithofaptos@keithofaptos·
That motherboard is helping them numbers I'd imagine also. Which is great . Thx for your share. This local Ai and these models are getting more awesome by the month. Loving it. I'm setting up a couple systems myself. A 4060 ti 16. A few other really older cards. And a B70 I'm about to figure out to. But I'm working towards what you’re working on. So I'll be listening.
English
1
0
1
31
AboveSpec
AboveSpec@above_spec·
@ProofOfPrints I see! I usually use Onshape and saw ppl on x use claude on onshape. Have you tried giving it more prompts to make it more sophisticated? Fit 120mm fans everywhere, lol.
English
1
0
0
14
Proof of Prints
Proof of Prints@ProofOfPrints·
@above_spec I actually tried Claude to create a case in FreeCAD and it came out pretty basic with no unity of parts. I may use it as a template if all measurement were correct. Let me know if you have any luck.
English
1
0
1
28
Proof of Prints
Proof of Prints@ProofOfPrints·
And so it begins. I need to print a case…
Proof of Prints tweet media
English
2
2
27
662
AJ
AJ@ItsmeAjayKV·
Heterogeneous speculative decoding aka Dovetail method (Ingenious naming btw). arXiv:2412.18934 How this is different from standard speculative decoding is that unlike SD where target and draft model lives on GPU, here draft is the only one on GPU, while the target lives on CPU doing validation. i.e GPU drafts, CPU verifies. Interesting part about this paper that gives me and you massive hopes is that it runs the models (in paper its older models) on very cheap and affordable consumer hardware.
AJ tweet mediaAJ tweet media
Sakura Yuki@sakurayukiai

We've been doing speculative decoding completely backwards. The Dovetail paper keeps the draft model on the GPU for fast generation, and dumps the massive target model on your CPU where cheap system RAM is totally fine for a single parallel verification pass.

English
1
0
1
123
AboveSpec
AboveSpec@above_spec·
@keithofaptos It may be more now, I am from Canada and got it for $579 CAD last year, which is around $400 USD. Looking at your link, looks like the last time they were $429 was in December... There is a refurbished PNY for $499 on newegg USA: newegg.com/pny-technologi…
English
1
0
1
261