David YT (@coffeecup2020) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

David YT@coffeecup2020·31 Mar

Turbo Quant not just for KV, can use it on weights. I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago. I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away. When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache. P/S. I was nearly got the KV done with cuda support but someone beat me on it. After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S: Walsh-Hadamard rotation 8-centroid quantization dual half-block scales CUDA runtime support in llama.cpp This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache. Main Result on Qwen3.5-27B Q4_0: 7.2431 +/- 0.04822 TQ3_1S: 7.2570 +/- 0.04802 That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512). Size Q4_0: about 14.4 GB TQ3_1S: about 12.9 GB So TQ3_1S is about 10% smaller while staying near Q4_0 quality. The practical point for me is simple: TQ3_1S fits fully on my 16GB RTX 5060 Ti Q4_0 does not fit fully on GPU in the same setup So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful: near-Q4_0 quality materially smaller than Q4_0 enough to make a 27B model practical on a 16GB card Caveats this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size I am pretty new to this, so I may miss a lot of test. I only have one card to test :-) Be skeptical as I can't believe I publish my own model the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0 Links GitHub fork: github.com/turbo-tan/llam… Hugging Face GGUF: huggingface.co/YTan2000/Qwen3… I will open source the quantization steps when I have enough feedback and test.

English

17

41

263

27.8K

David YT@coffeecup2020·17h

@kis どうぞお楽しみください！お待ちいただきありがとうございます。

日本語

0

1

5

きしだൠ(K1S)@kis·17h

@coffeecup2020 I've confirmed and it works well now. Thanks!

English

1

0

1

12

きしだൠ(K1S)@kis·1d

@coffeecup2020 I couldn't build llama-server in llama.cpp-tq3 on Windows, because cmake generated llama-server.vcxproj file contains rt.lib which is Linux library. after I just removed it, it could be built and runs well.

English

1

0

288

David YT@coffeecup2020·22h

Deepseek (ship) v4 is here. Trying to find disk space to fit it.

English

0

2

309

David YT@coffeecup2020·1d

@TeksEdge Main problem is power consumption. Who needs 70b when 27b beat 15x of its size?

English

1

0

1

366

David Hendrickson@TeksEdge·1d

❓ Is dual RTX-3090 rig worth $4K for home inferencing? 🔹 2× RTX 3090 GPUs in one PC with shared mem 🔹 + NVLink bridge for best performance 🔹 Gives effective 48 GB VRAM via tensor parallel (TP=2) ✅ Memory is pooled for large models (70B+ Q4/Q5) ✅ vLLM / llama.cpp / ExLlamaV2 split & run across both ❌ 4090 & 5090 lack NVLink → slower PCIe-only multi-GPU Still a top value local LLM rig in 2026.

English

30

20

185

21.3K

David YT@coffeecup2020·1d

@soyhenryxyz No worries if you are new, just try something else first. Mine is a special build so not easy to follow. Good luck

English

1

0

27

Henry Moran@soyhenryxyz·1d

@coffeecup2020 thx a lot. its current way over my head so im going through it slowly its unclear to me if omlx.ai server already supports this or not it seems to be designed for agents (hermes) and inference

English

1

0

11

David YT@coffeecup2020·2d

Qwen3.6-2.7B finally is here. TurboQuant version is here. Enjoy. Watch out for a smaller and smarter 35B later. huggingface.co/YTan2000/Qwen3…

English

23

43

510

59K

David YT@coffeecup2020·1d

@soyhenryxyz Hi this one for you github.com/turbo-tan/llam… I tried at Ubuntu and it should work for Mac LmStudio, only you can tell. If works, can you let me know the instruction and I can update

English

1

0

68

Henry Moran@soyhenryxyz·1d

@coffeecup2020 How do I run this model? I tried lmstudio and olmx pass but both didn't work 😩

English

2

0

265

David YT@coffeecup2020·1d

@Laythe_li_suwi What do you want to use it for? May be it is blessing in disguise so that you can still think and code by yourself :-)

English

0

12

Laythe@Laythe_li_suwi·1d

@coffeecup2020 35b works! moe models are insane but yes i do hope for a future 4b models are that capable :p

English

1

0

1

17

David YT@coffeecup2020·2d

Qwen3.6-27B-TQ3_4S is insanely good! huggingface.co/YTan2000/Qwen3… fit on my 16GB with 32k context Two prompts and I get this!

English

16

17

213

24.1K

David YT@coffeecup2020·1d

@Laythe_li_suwi Qwen3.5 4B may be? I will get smaller and more clever. The models are really bloated. 8GB was able to run windows XP with plenty of storage spare. People are lazy to get things smaller :-)

English

1

0

35

Laythe@Laythe_li_suwi·1d

@coffeecup2020 3050 8gb, i'm not surprised it's running that bad :p

English

1

0

16

David YT@coffeecup2020·1d

@soyhenryxyz I recommend you follow github.com/TheTom/turboqu… . He has TQ4_1S, that works on MLX. You can also try the turbo quant KV compression.

English

1

0

74

David YT@coffeecup2020·1d

@soyhenryxyz Unfortunately not. It could but I don't have Mac to test and build, want to send me one or crowd fund one :-) ?

English

1

0

48

David YT@coffeecup2020·1d

@apptor_at Yes, you will love it. Absolutely amazing buy. My only problem is if I develop these model and want to test it, I need extra card running AI :-)

English

0

1

17

Apptor@apptor_at·1d

@coffeecup2020 Amazing! Got a 5060 ti 16gb myself. Will follow you!

English

1

0

10

David YT@coffeecup2020·31 Mar

Turbo Quant not just for KV, can use it on weights. I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago. I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away. When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache. P/S. I was nearly got the KV done with cuda support but someone beat me on it. After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S: Walsh-Hadamard rotation 8-centroid quantization dual half-block scales CUDA runtime support in llama.cpp This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache. Main Result on Qwen3.5-27B Q4_0: 7.2431 +/- 0.04822 TQ3_1S: 7.2570 +/- 0.04802 That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512). Size Q4_0: about 14.4 GB TQ3_1S: about 12.9 GB So TQ3_1S is about 10% smaller while staying near Q4_0 quality. The practical point for me is simple: TQ3_1S fits fully on my 16GB RTX 5060 Ti Q4_0 does not fit fully on GPU in the same setup So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful: near-Q4_0 quality materially smaller than Q4_0 enough to make a 27B model practical on a 16GB card Caveats this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size I am pretty new to this, so I may miss a lot of test. I only have one card to test :-) Be skeptical as I can't believe I publish my own model the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0 Links GitHub fork: github.com/turbo-tan/llam… Hugging Face GGUF: huggingface.co/YTan2000/Qwen3… I will open source the quantization steps when I have enough feedback and test.

English

17

41

263

27.8K

David YT@coffeecup2020·1d

@Laythe_li_suwi what's your GPU?

English

1

0

1

12

Laythe@Laythe_li_suwi·1d

@coffeecup2020 that's my record speed with 35b 🥲

English

1

0

15

David YT@coffeecup2020·1d

@Laythe_li_suwi 24-26 tok/s. Surprisely fast if we run a chat, it is faster than you can read. But no where near a35b

English

1

0

1

76

Laythe@Laythe_li_suwi·1d

@coffeecup2020 what generation speed?

English

1

0

76

David YT@coffeecup2020·1d

Time to prepare AI locally! Sooner or later you will see one of these:

English

0

1

110

David YT@coffeecup2020·1d

@TheOneHong what's your gpu vram size, 12gb ?

English

0

139

TheOneHong阿康@TheOneHong·2d

@coffeecup2020 q2 version pls

English

1

0

460

David YT@coffeecup2020·1d

@petllama Mine is better than q2 and it sits between q3 and q4. Wait for few days and I will publish an update version that is better than the above 35. Just need time to finish testing

English

1

0

2

28

Keith@petllama·1d

@coffeecup2020 Thanks, I've been having good luck with 3.6 35b q2 but I see everyone saying q2 gets too "dumbed down". Seeing all the praise from 27b got me curious to try and compare. (4080, 7800x3d w/ 32gb)

English

1

0

50

David YT@coffeecup2020·1d

@petllama Yes, if you set ngl less than 99, it will offload some to cpu but will be slower. I recommend you use 35b for simpler task. Coding task is also ok but a lot faster huggingface.co/YTan2000/Qwen3…

English

1

0

2

206

Keith@petllama·1d

@coffeecup2020 Will this work with a higher context by spilling onto CPU/ram? Sorry, really new to all this. Hermes needs 64k and even that seems low to me?

English

1

0

224

David YT@coffeecup2020·1d

@gadevenyi PM me.

English

0

1

12