Sabitlenmiş Tweet

Turbo Quant not just for KV, can use it on weights.
I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw.
I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago.
I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away.
When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache.
P/S. I was nearly got the KV done with cuda support but someone beat me on it.
After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S:
Walsh-Hadamard rotation
8-centroid quantization
dual half-block scales
CUDA runtime support in llama.cpp
This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache.
Main Result on Qwen3.5-27B
Q4_0: 7.2431 +/- 0.04822
TQ3_1S: 7.2570 +/- 0.04802
That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512).
Size
Q4_0: about 14.4 GB
TQ3_1S: about 12.9 GB
So TQ3_1S is about 10% smaller while staying near Q4_0 quality.
The practical point for me is simple:
TQ3_1S fits fully on my 16GB RTX 5060 Ti
Q4_0 does not fit fully on GPU in the same setup
So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful:
near-Q4_0 quality
materially smaller than Q4_0
enough to make a 27B model practical on a 16GB card
Caveats
this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size
I am pretty new to this, so I may miss a lot of test. I only have one card to test :-)
Be skeptical as I can't believe I publish my own model
the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0
Links
GitHub fork: github.com/turbo-tan/llam…
Hugging Face GGUF: huggingface.co/YTan2000/Qwen3…
I will open source the quantization steps when I have enough feedback and test.

English












