Draneil Mifa

220 posts

Draneil Mifa

@DragonGroky

Katılım Şubat 2025

259 Takip Edilen5 Takipçiler

Draneil Mifa@DragonGroky·6h

@0xCVYH I use flux Klein 9B with comfyui . Is there a node to load this new polarquant ?

English

CV.YH@0xCVYH·10h

@DragonGroky PolarQuant uses a different approach than GGUF. Hadamard rotation + Lloyd-Max optimal centroids give better quality at the same bitrate. GGUF packs bits, PolarQuant optimizes the quantization itself. Working on llama.cpp integration tho

English

CV.YH@0xCVYH·1d

PolarQuant agora quantiza modelos de IMAGEM. FLUX.2 klein 9B com Q5: 0.9986 de similaridade com o original. Praticamente identico. 121 camadas quantizadas, 304 preservadas em BF16. Hadamard + Lloyd-Max funcionando em diffusion models. Alguem pediu hoje e ja ta pronto. huggingface.co/caiovicentino1…

Português

140

8.9K

Draneil Mifa@DragonGroky·10h

@0xCVYH cool, but why it is not a gguf as usual on github ?

English

CV.YH@0xCVYH·10h

@DragonGroky Yes, fits perfectly on a 3090 24GB. PQ5 quality is actually better than standard Q8 at similar size because of the Hadamard rotation + Lloyd-Max centroids. 0.996 cosine similarity vs original. Give it a try and compare

English

293

Draneil Mifa@DragonGroky·12h

@MyopicRaccoon @LottoLabs @no_stp_on_snek i ve a 3090 mais reachs only 20token/s ! can you share your llama.cpp parameters ? i've also turboquant version. llama-server -m models-d/Qwopus3.5-27B-v3-Q5_K_M.gguf -c 128000 \ -ctk turbo3 -ctv turbo3 \ -fa on -np 1 \ --cache-ram 4096 \ --ctx-checkpoints 4 \ --fit on

English

119

Myopic Raccoon@MyopicRaccoon·13h

@LottoLabs Here's what I am running, Qwen3.5-27B-UD-Q5_K_XL.gguf, with @no_stp_on_snek's turboquant (turbo4), 128K context (I could double if I drop to Q4), on a single RTX 3090/24GB. Getting a solid 31 tok/s. and Hermes 0.70. Heaven.

English

851

Lotto@LottoLabs·1d

People yearn for qwen 27b and Hermes agent Tailscale, used desktop You’re cooking

English

109

5.7K

Draneil Mifa@DragonGroky·18h

@outsource_ it is better than Qwen3.5 27B and does it run on a 3090 ?

English

Eric ⚡️ Building...@outsource_·1d

REQUESTED: Qwen3 30B Coder-Next? The community delivered 🔥 mradermacher/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32-i1-GGUF ⚡ Distilled from 480B into 30B Coder 📈 SWE-Rebench / agentic coding ✅ runs perfectly local with GGUF quants Exactly what you wanted, consumer-hardware friendly: huggingface.co/mradermacher/Q…

English

6.4K

Draneil Mifa@DragonGroky·19h

@IRMC16 @sudoingX i tried you config but it is at about 20 tokens / sec. It is because you have a 4090 and me a 3090 ?

English

IRMC@IRMC16·1d

@DragonGroky @sudoingX - "--cache-type-k" - "q4_0" - "--cache-type-v" - "q4_0" - "--host" - "0.0.0.0" - "--port" - "8000" - "--alias" - "Qwen3.5-27B"

Sudo su@sudoingX·2d

people keep asking me what model to run on a single 3090. it's not even close. Qwen 3.5 27B dense Q4_K_M. undisputed.

kumikumi (Ankkala)@ankkala

@sudoingX to be clear, which model / quantization did you run on the 3090?

English

551

34K

Draneil Mifa@DragonGroky·20h

@IRMC16 @sudoingX I was having same expectations for Gemma-4 31B but I saw that it is only better for chating but not for coding . We have to wait for qwen 3.6 ☺️

English

IRMC@IRMC16·1d

@DragonGroky @sudoingX I was waiting for Gemma-4-31B with TurboQuant cache. To swap out Qwen3.5-27B. My expectation was: Bigger is better. Apparently not for my use case. x.com/TeksEdge/statu… x.com/leftcurvedev_/…

left curve dev@leftcurvedev_

everyone is wondering the same thing qwen3.5 27b or gemma 4 31b? new benchmarks from @ArtificialAnlys are out, let's dig the numbers: 💻 coding index > gemma wins surprisingly it was the best and scored 42 — it managed to handle more coding tasks successfully than qwen, very interesting! 👀 🤖 agentic index > qwen destroys gemma when it comes to tool calls, multi-step reasoning, and autonomous task execution, there's no need to talk about it — qwen is the absolute winner in the category, scoring 55 (!) vs 41 for gemma 🤯 👑 the winner > qwen3.5 27b stays undefeated gemma could have been an amazing contender but in agentic tasks it's just too much behind, compared to what qwen has to offer — makes no sense to use it if your tasks are heavy and need reasoning what are your thoughts?

English

Draneil Mifa@DragonGroky·1d

@ai_hakase_ sounds good ! could you share your wf ?

English

③ 日本語1クリックで完結！最強の全自動ワークフロー🤖 「設定が難しそう⋯」という方もご安心ください！最新のAI「Qwen 3.5」を組み合わせれば、日本語で「夕暮れを歩く少女」と入力するだけでOK✨ AIが自動で最適な英語プロンプトやカメラワークの指示を作ってくれます。面倒な計算もすべてAIにお任せして、最高の一本を作りましょう！

日本語

1.1K

【LTX-2.3覚醒】映像が劇的に動き出す「3-Pass Sampling」の衝撃！🚀✨ 動画生成AIで「動きが硬い⋯」と悩んでいませんか？生成時間はそのままで、クオリティを極限まで高める「チート級の裏技」が登場しました！ 👇️ このスレッドで解説していきます！ #AI動画 #LTX2 #動画生成AI #生成AI

日本語

185

21.6K

Draneil Mifa@DragonGroky·1d

@loktar00 for a 3090, it is the good parameters for using TurboQuant ? export PATH=/ia/llama-cpp-turboquant-cuda/build/bin:$PATH llama-server -m models-d/Qwopus3.5-27B-v3-Q4_K_M.gguf \ -c 132000 -n 4096 \ --cache-type-k q8_0 --cache-type-v turbo3 \ -fa on -np 1 -ngl 99

English

1.8K

Loktar 🇺🇸@loktar00·1d

TurboQuant is getting ported to llama.cpp and if it actually delivers 6x memory compression with zero accuracy loss.... your 24GB 3090 basically becomes a 144GB card for KV cache. The local inference cost equation keeps getting more insane every week

English

113

78.7K

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX really good 34 t/s. would you mind to share the llama.cpp command ? i'm at 17 t/s with 3090, i guess i should reach similar speed as you ?

English

IRMC@IRMC16·1d

@DragonGroky @sudoingX It drops to 34 t/s on average for larger amounts of tokens.

English

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX thanks. you are also on win11 with WSL Linux for turbo quant? This is my old conf: llama-server -m models/Qwen3.5-27B-Q4_K_M.gguf --alias "Qwen3.5-27B-Q4_K_M" \ --n-gpu-layers 99 \ -c 262144 \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --port 8000 --host 0.0.0.0 \ -fa on

English

IRMC@IRMC16·1d

@DragonGroky @sudoingX For me 196608 is a sweet spot. Larger amounts get flushed into RAM, dropping the performance.

English

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX How many token /sec do you have with your setup ? I was using Q4 K_M and 200k context length but switched to Q5 with turbo quant and reduced content length

English

IRMC@IRMC16·1d

@DragonGroky @sudoingX My Qwen3.5-27B-Q4_K_M implementation (third compose) has a context length of 196608. It runs flawless with OpenClaw, OpenCode/GSD and my home brewed LangGraph agents such as a Docling/LangExtract agent. I was hunting 512k with TQ3_4S.

English

Draneil Mifa@DragonGroky·1d

@sudoingX Using Hermes and coPaw, I ve better result with coPaw (LLM : qwen 27B Q5 turbo quant). Using on a 3090

English

Sudo su@sudoingX·2d

if you're still on openclaw bloatware in april 2026, you're not a builder. you're a tourist that has 0 idea how things are accelerating. try hermes agent for yourself. take that step and make the switch today anon. and i'm not the one calling it bloat that came from builders in my dms and timeline. bloat is their words, not mine. and if you have already made the switch, let people know your experience so more builders on bloated tools become aware. you deserve a better tool in this fast changing field.

Teknium (e/λ)@Teknium

Hermes Agent is the third fastest growing GitHub repo this week!

English

6.2K

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX i tried hermes and not so good. it was lost in loop. Now i use Copaw (Alibaba agent) and it works really good. Python code generating HTML dashboard from different API source.

English

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX i ve a 3090 and run also eveything locally. For now i'm using Qwopus3.5 27B Q5 K_M and it is about 15/20 t/s with 130k context length

English

Draneil Mifa@DragonGroky·1d

Using Qwopus3 Q5 and Hermes is not so good, Hermes is quickly running in loop to try to patch but with copaw, i don't have this problem, it looks much better and doesn't go in loop.

English

Draneil Mifa@DragonGroky·1d

@IRMC16 @sudoingX Why Q3 ? with your 4090, you can go to Q4 or Q5 no ?

English

IRMC@IRMC16·2d

@sudoingX I'm currently running Qwopus3.5-27B-v3-TQ3_4S, the Qwen3.5-27B 'Opus' v3 fork with TurboQuant cache. Context length 256k, 35 tokens/sec on a 4090. huggingface.co/YTan2000/Qwopu…

English

260

Draneil Mifa@DragonGroky·1d

@sudoingX on my 3090, I'm running Q5_K_M. llama-server -m models-d/Qwopus3.5-27B-v3-Q5_K_M.gguf --alias "Qwen3.5-27B-Q5-Qwopus3" \ -c 128000 \ -ctk turbo3 -ctv turbo3 \ --port 8000 --host 0.0.0.0 \ -fa on -np 1 \ --cache-ram 4096 \ --ctx-checkpoints 4 \ --fit on \ --reasoning off

English

203

Draneil Mifa@DragonGroky·1d

@blueemi99 @TheAhmadOsman What about coding ?

English

Blue.dev@blueemi99·2d

@TheAhmadOsman 26b a4b gemma 4>qwen 3.5 27b dense

Magyar

2.1K

Ahmad@TheAhmadOsman·2d

you like Chinese opensource models then use Qwen 3.5 27B you like American opensource models then use Gemma 4 31B both can run easily on consumer hardware at home and they’re State of The Art models

English

1.1K

38.4K

Draneil Mifa@DragonGroky·1d

@TheAhmadOsman Personally I don’t care it is Chinese , American , French or whatever , I just want an LLM running on my 3090 and able to gen quality code like sonnet 4.5 or Opus 4.6. For now I am using Qwopus3 Q5 + turboquant llama.cpp version but happy to try Gemma if better

English

Keşfet

@0xCVYH @MyopicRaccoon @LottoLabs @no_stp_on_snek @outsource_ @IRMC16 @sudoingX @ai_hakase_