David YT

310 posts

David YT banner
David YT

David YT

@coffeecup2020

Self-taught Local LLM dev with RTX 5060 Ti 16GB rig Fine-tuning models. CEO of Boiler AI 😂 Author of TQ3_1S & TQ3_4S (Turbo Quant AI weight compression)🦙⚡

Katılım Şubat 2023
265 Takip Edilen599 Takipçiler
Sabitlenmiş Tweet
David YT
David YT@coffeecup2020·
Turbo Quant not just for KV, can use it on weights. I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago. I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away. When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache. P/S. I was nearly got the KV done with cuda support but someone beat me on it. After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S: Walsh-Hadamard rotation 8-centroid quantization dual half-block scales CUDA runtime support in llama.cpp This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache. Main Result on Qwen3.5-27B Q4_0: 7.2431 +/- 0.04822 TQ3_1S: 7.2570 +/- 0.04802 That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512). Size Q4_0: about 14.4 GB TQ3_1S: about 12.9 GB So TQ3_1S is about 10% smaller while staying near Q4_0 quality. The practical point for me is simple: TQ3_1S fits fully on my 16GB RTX 5060 Ti Q4_0 does not fit fully on GPU in the same setup So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful: near-Q4_0 quality materially smaller than Q4_0 enough to make a 27B model practical on a 16GB card Caveats this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size I am pretty new to this, so I may miss a lot of test. I only have one card to test :-) Be skeptical as I can't believe I publish my own model the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0 Links GitHub fork: github.com/turbo-tan/llam… Hugging Face GGUF: huggingface.co/YTan2000/Qwen3… I will open source the quantization steps when I have enough feedback and test.
David YT tweet media
English
17
41
263
27.8K
David YT
David YT@coffeecup2020·
@kis どうぞお楽しみください!お待ちいただきありがとうございます。
日本語
0
0
1
5
きしだൠ(K1S)
@coffeecup2020 I couldn't build llama-server in llama.cpp-tq3 on Windows, because cmake generated llama-server.vcxproj file contains rt.lib which is Linux library. after I just removed it, it could be built and runs well.
きしだൠ(K1S) tweet media
English
1
0
0
288
David YT
David YT@coffeecup2020·
Deepseek (ship) v4 is here. Trying to find disk space to fit it.
David YT tweet media
English
0
0
2
309
David YT
David YT@coffeecup2020·
@TeksEdge Main problem is power consumption. Who needs 70b when 27b beat 15x of its size?
English
1
0
1
366
David Hendrickson
David Hendrickson@TeksEdge·
❓ Is dual RTX-3090 rig worth $4K for home inferencing? 🔹 2× RTX 3090 GPUs in one PC with shared mem 🔹 + NVLink bridge for best performance 🔹 Gives effective 48 GB VRAM via tensor parallel (TP=2) ✅ Memory is pooled for large models (70B+ Q4/Q5) ✅ vLLM / llama.cpp / ExLlamaV2 split & run across both ❌ 4090 & 5090 lack NVLink → slower PCIe-only multi-GPU Still a top value local LLM rig in 2026.
David Hendrickson tweet media
English
30
20
185
21.3K
David YT
David YT@coffeecup2020·
@soyhenryxyz No worries if you are new, just try something else first. Mine is a special build so not easy to follow. Good luck
English
1
0
0
27
Henry Moran
Henry Moran@soyhenryxyz·
@coffeecup2020 thx a lot. its current way over my head so im going through it slowly its unclear to me if omlx.ai server already supports this or not it seems to be designed for agents (hermes) and inference
English
1
0
0
11
Henry Moran
Henry Moran@soyhenryxyz·
@coffeecup2020 How do I run this model? I tried lmstudio and olmx pass but both didn't work 😩
English
2
0
0
265
David YT
David YT@coffeecup2020·
@Laythe_li_suwi What do you want to use it for? May be it is blessing in disguise so that you can still think and code by yourself :-)
English
0
0
0
12
Laythe
Laythe@Laythe_li_suwi·
@coffeecup2020 35b works! moe models are insane but yes i do hope for a future 4b models are that capable :p
English
1
0
1
17
David YT
David YT@coffeecup2020·
@Laythe_li_suwi Qwen3.5 4B may be? I will get smaller and more clever. The models are really bloated. 8GB was able to run windows XP with plenty of storage spare. People are lazy to get things smaller :-)
English
1
0
0
35
Laythe
Laythe@Laythe_li_suwi·
@coffeecup2020 3050 8gb, i'm not surprised it's running that bad :p
English
1
0
0
16
David YT
David YT@coffeecup2020·
@soyhenryxyz Unfortunately not. It could but I don't have Mac to test and build, want to send me one or crowd fund one :-) ?
English
1
0
0
48
David YT
David YT@coffeecup2020·
@apptor_at Yes, you will love it. Absolutely amazing buy. My only problem is if I develop these model and want to test it, I need extra card running AI :-)
English
0
0
1
17
Apptor
Apptor@apptor_at·
@coffeecup2020 Amazing! Got a 5060 ti 16gb myself. Will follow you!
English
1
0
0
10
David YT
David YT@coffeecup2020·
Turbo Quant not just for KV, can use it on weights. I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago. I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away. When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache. P/S. I was nearly got the KV done with cuda support but someone beat me on it. After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S: Walsh-Hadamard rotation 8-centroid quantization dual half-block scales CUDA runtime support in llama.cpp This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache. Main Result on Qwen3.5-27B Q4_0: 7.2431 +/- 0.04822 TQ3_1S: 7.2570 +/- 0.04802 That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512). Size Q4_0: about 14.4 GB TQ3_1S: about 12.9 GB So TQ3_1S is about 10% smaller while staying near Q4_0 quality. The practical point for me is simple: TQ3_1S fits fully on my 16GB RTX 5060 Ti Q4_0 does not fit fully on GPU in the same setup So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful: near-Q4_0 quality materially smaller than Q4_0 enough to make a 27B model practical on a 16GB card Caveats this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size I am pretty new to this, so I may miss a lot of test. I only have one card to test :-) Be skeptical as I can't believe I publish my own model the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0 Links GitHub fork: github.com/turbo-tan/llam… Hugging Face GGUF: huggingface.co/YTan2000/Qwen3… I will open source the quantization steps when I have enough feedback and test.
David YT tweet media
English
17
41
263
27.8K
David YT
David YT@coffeecup2020·
@Laythe_li_suwi 24-26 tok/s. Surprisely fast if we run a chat, it is faster than you can read. But no where near a35b
English
1
0
1
76
David YT
David YT@coffeecup2020·
Time to prepare AI locally! Sooner or later you will see one of these:
David YT tweet media
English
0
0
1
110
David YT
David YT@coffeecup2020·
@petllama Mine is better than q2 and it sits between q3 and q4. Wait for few days and I will publish an update version that is better than the above 35. Just need time to finish testing
English
1
0
2
28
Keith
Keith@petllama·
@coffeecup2020 Thanks, I've been having good luck with 3.6 35b q2 but I see everyone saying q2 gets too "dumbed down". Seeing all the praise from 27b got me curious to try and compare. (4080, 7800x3d w/ 32gb)
English
1
0
0
50
David YT
David YT@coffeecup2020·
@petllama Yes, if you set ngl less than 99, it will offload some to cpu but will be slower. I recommend you use 35b for simpler task. Coding task is also ok but a lot faster huggingface.co/YTan2000/Qwen3…
English
1
0
2
206
Keith
Keith@petllama·
@coffeecup2020 Will this work with a higher context by spilling onto CPU/ram? Sorry, really new to all this. Hermes needs 64k and even that seems low to me?
English
1
0
0
224
David YT
David YT@coffeecup2020·
@gadevenyi Btw I hate this enum thing. It is so inflexible. I may change it to have other metadata
English
1
0
1
14
Gabriel A. Devenyi
Gabriel A. Devenyi@gadevenyi·
@coffeecup2020 Here's the problem. Both of you are using the same tensor IDs for your quants but they mean different things. Someone needs to change!
Gabriel A. Devenyi tweet media
English
3
0
0
17