bgeneto

115 posts

bgeneto

@netobge

Katılım Kasım 2022

138 Takip Edilen9 Takipçiler

bgeneto@netobge·1d

@CarlosZarattini Deixa ver se eu entendi sua posição: então o dado que desmonta essa mentira de vez diz que menos de 3% dos beneficiários trabalham de carteira assinada... Puxa, 3% é realmente um desmonte...

Português

Carlos Zarattini@CarlosZarattini·3d

Absurda e desinformada essa declaração contra o Bolsa Família. É inadmissível que ainda seja preciso repetir o óbvio. O Bolsa Família é, sim, um estímulo à mobilidade social. O programa garante comida na mesa de quase 50 milhões de brasileiros e ajuda famílias inteiras a atravessarem a pobreza com dignidade. Entre 2023 e 2024, 8,6 milhões de pessoas saíram da pobreza e 1,9 milhão deixou a extrema pobreza no Brasil. E o dado que desmonta essa mentira de vez: em 2024, beneficiários do Bolsa Família ocuparam 1,2 milhão de postos formais de trabalho. Quem diz que o Bolsa Família “acomoda” simplesmente despreza a realidade do povo brasileiro. poder360.com.br/poder-economia…

Português

284

975

6.4K

bgeneto@netobge·13 May

@cjzafir Have you published your model? Where?

English

CJ Zafir@cjzafir·12 May

Qwen 3.5 has the best SLMs to fine-tune! Its 4B model is really smart if you train it on a well structured dataset. I fine-tuned the model on a 135M dataset generated by Codex 5.5 + DeepSeek v4 Pro. I achieved 96%+ accurate results with Qwen 3.5 4B. And 95% on Qwen 3.5 2B (that only requires 3.5GB RAM). For context, on the same pipeline: > Sonnet 4.6 achieved 89% > GPT 5.4 Mini achieved 85% > Haiku 4.5 achieved 72% I don't trust evals, so I ran a 7000+ row hard-boundary test, and the results of Qwen 3.5 were consistent. A 4B fine-tuned model beating a 20x bigger model in accuracy and latency is no joke. It cost me $173 in total to generate the dataset and cover the cloud GPU cost to fine-tune both models. I said this before, and I'll say it again: not everything requires a 1T-parameter LLM. We need ELMs (Expert Language Models) that are specialized for one domain only. ELMs > LLMs. I'll be writing more about how SLM fine-tuning works. So stay tuned.

English

695

27.6K

bgeneto@netobge·13 May

@TeksEdge While anyone with a much cheaper RTX 3090 can run Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound with vLLM at 150t/s without MTP with fp8 kv cache and 128k context.

English

580

David Hendrickson@TeksEdge·12 May

🤯 Unsloth released the fastest Qwen3.6-27B MTP GGUF I've tested. Time to upgrade. Compared to the previous GGUF, Q4/Q6 XL versions are 👀 ~55% faster! On a single RTX 5090: ✅ 114 tok/s — UD-IQ2_M (MTP) ✅ 93 tok/s — UD-Q4_K_XL (MTP) ✅ 75 tok/s — UD-Q6_K_XL (MTP) 💨Fastest MTP quant is 3.3x faster than the old Q8_0 baseline (35 tps) 262K context + tool calling. All on one 5090. * compiled from the MTP PR branch ('am17an:mtp-clean', build b9117-ebe4fca4b)

English

523

46.6K

bgeneto@netobge·3 May

@loktar00 27B quality with 35B tps would be a dream with 3090. But all those inflated numbers with 27B and 3090 are unreal and quality compromised... 35B with vLLM is so stable that I stopped searching for a better alternative with a single 3090.

English

Loktar 🇺🇸@loktar00·2 May

I wish 3.6 35B was just a little better.. the speeds I'm getting are insane.

English

173

13.4K

bgeneto@netobge·3 May

@malikwas1f @largePrawn I did... Several times, since day-0 and also today. No more then 60 tps with 2x3090. Much better tps with single gpu and qwen3.6 35B (130 tps) without even spec dec.

English

noname@malikwas1f·2 May

@netobge @largePrawn Those numbers are real life and real time. Go check out the repo and do it yourself.

English

Tony Ge@largePrawn·2 May

Hitting 140 tok/s on Qwen 3.6 27B running vLLM with 2x 3090s using the following @malikwas1f's repo github.com/noonghunna/clu… Literally just pointed claude at it and walked away. Came back to a 2.5x speed bump 🤯🤯🤯

English

347

27.8K

bgeneto@netobge·2 May

@rafaon3 @luksamuk A3B tá lento, tenta o Intel/Qwen3.6 35B AutoRound com vLLM, consigo 130 tok/s com ele e 60 tok/s com o Qwen3.6 27B, mas não uso pq esses modelos pensam d+ e 60 tps fica extremamente lento para codar com 128k tokens.

Português

rafaon3@rafaon3·2 May

@luksamuk Rodo o a3 a 90tks na 3090 e 27 a 41tks não entendo mas aceito

Português

Lucas@luksamuk·2 May

Até agora, o coding champion aqui, numa RTX 3050 com 6GB de VRAM foi o Qwen 3.6 35B-A3B. Quantização: UD-Q3_K_L. Arquitetura MoE ajuda com velocidade; qualidade inigualável; bom tradeoff com velocidade. Não duvido que o 27B faça coisa melhor, mas é lento que dói (limitação minha)

Português

3.5K

bgeneto@netobge·29 Nis

@EnioViterbo Que moral ele tem pra falar assim? Mundo louco esse, ministro do STF fingindo que embolsar +80 milhões e tá tudo bem, vida que segue. Só no Brasil.

Português

109

Enio Viterbo@EnioViterbo·28 Nis

Pelo amor de Deus. Se controla, Alexandre. O ministro Alexandre de Moraes aproveitou o julgamento de um processo do deputado Gustavo Gayer contra um outro deputado e simplesmente começou a mandar indiretas para o Romeu Zema. Um completo desvio de finalidade. Um desrespeito com o dinheiro público. Um desrespeito com o Direito e com o processo penal. Um desrespeito com o STF. Os ministros Alexandre de Moraes e Gilmar Mendes têm que aprender que não é porque tem um microfone ali na mesa que eles podem falar qualquer coisa. A sessão de julgamento dos processos é DOS PROCESSOS. Não é pra cantar. Não é pra recitar poesia. Não é pra mandar recados políticos. Se quiserem um microfone e uma bancada para dar recados políticos, candidatem-se ao Congresso.

Português

512

2.2K

11.8K

264.6K

bgeneto@netobge·28 Nis

@MemoryReboot_ I'm getting 130-160 tok/s with one RTX 3090 and Intel AutoRound Qwen3.6 35B A3B without any spec dec. So you certainly have a regression in speed here.

English

296

Mass@MemoryReboot_·27 Nis

DFlash benchmarks on dual RTX 3090 Qwen3.6-35B-A3B AWQ-INT4 + DFlash drafter on vLLM nightly, TP=2 Tried different num_speculative_tokens to find what works: - n=4: 96.6 tok/s - n=8: 96.0 tok/s - n=15 (z-lab's recommended): 20-40 tok/s n=4 is a sweet spot For comparison, Qwen3.6-35B-A3B Q6 on llama.cpp gives me 102 tok/s on the same hardware ☹️ What am I doing wrong?

English

4.7K

bgeneto@netobge·25 Nis

@spiritbuun Still slower... forgot to mention that gguf model used is: lmstudio-community/Qwen3.6-27B-GGUF cmake -B build --fresh \ -DGGML_CUDA=ON \ -DGGML_NATIVE=ON \ -DGGML_CUDA_FA=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=86

English

buun@spiritbuun·25 Nis

@netobge Can you repull from master, build, and try it again? Might have fixed it

English

buun@spiritbuun·23 Nis

Pushed: DFlash implementation for llama-cpp. buun-llama-cpp/llama-server -m Qwen3.6-27B.gguf -md dflash-draft-q4_k_m.gguf --spec-type dflash

Dansk

419

39.6K

bgeneto@netobge·25 Nis

@spiritbuun ./build/bin/llama-server \ -m ~/models/Qwen3.6-27B-Q4_K_M.gguf -md ~/models/dflash-draft-3.6-q4_k_m.gguf --spec-type dflash \ --reasoning on \ --reasoning-budget -1 \ --ctx-size 32000 \ --fit off \ -ngl 99 -ngld 99 \ --flash-attn on \ -ctk q8_0 \ -ctv q8_0...

English

buun@spiritbuun·25 Nis

@netobge Can you paste me your llama flags so I can reproduce?

English

bgeneto@netobge·25 Nis

@unbug @elliotarledge Yes. Bunn llama cpp has DFlash support but no luck for me, it worked but slower

English

unbug@unbug·25 Nis

@elliotarledge Any news for llamacpp?

English

1.7K

Elliot Arledge@elliotarledge·25 Nis

DFlash is the future of inference. huggingface.co/z-lab/Qwen3.6-…

English

484

39.5K

bgeneto@netobge·25 Nis

@PelicanInvasion @elliotarledge Not for me, no speed gains with the already fast 35B MoE

English

PelicanInvasion@PelicanInvasion·25 Nis

@elliotarledge How fast on 5090 GPUs? Any work on Qwen 3.6 35b for even more speed?

English

1.7K

bgeneto@netobge·25 Nis

@spiritbuun I'm getting around 0.54x tok/s with this. RTX 3090 here: 37 tok/s vanilla and 20 tok/s with dflash. 😟 draft acceptance rate = 0.25519 ( 1401 accepted / 5490 generated) statistics dflash: #calls(b,g,a) = 1 366 304, #gen drafts = 366, #acc drafts = 304, #gen tokens = 5490, #acc

English

buun@spiritbuun·23 Nis

I get around 2x tok/s on average with this, more if it's simple code/json.

English

3.5K

bgeneto@netobge·25 Nis

@pupposandro Still around 10s time for first token? 👀

English

Sandro@pupposandro·24 Nis

TQ3_0 (TurboQuant) KV cache just landed in Lucebox Hub. 22% less VRAM than Q4_0, same decode speed. 262K context on a single RTX 3090 with 1024 MiB to spare. Qwen3.5-27B, Q4_K_M target, DFlash speculative decode. TurboQuant 3.5 bpv with FWHT rotation, CUDA kernels end-to-end, flash-attention plugged in for both K and V. Prefill pays ~12% for the rotation, decode pays nothing. Huge thanks to @dusterbloom for providing this to the community. Repo as usual in the first comment ⬇️

English

137

bgeneto@netobge·24 Nis

@ray5ar @outsource_ Don't work for me, not even using ik_llama, running at 34 tok/s, slower than without spec dec

English

X Æ A-12@ray5ar·24 Nis

@outsource_ i will do the test on my 3090 !

English

137

Eric ⚡️ Building...@outsource_·24 Nis

Quick update: pushed the 4090 further!💡 192K context at 152 tok/s on Qwen3.6-27B, single GPU. 128K hits 159. Same Q4_K_M. Vanilla Qwen3-1.7B draft beat the distilled 4B draft. Smaller > smarter for spec-dec. Next: 1M context locally + 250-400 tok/s via DFlash + TurboQuant. Receipts coming.

Eric ⚡️ Building...@outsource_

My 4090 went from 26 -> 154 tok/s Qwen 3.6 27B🤯 Same GPU. Same Q4_K_M . No FP8, no extra quant. The unlock: ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model. 85% acceptance rate. Full config + benchmarks 👇🏻

English

481

41.4K

bgeneto@netobge·24 Nis

@outsource_ Does NOT work! At least for me... Getting 34 tok/s only with my RTX 3090. 45 tok/s without Qwen3 drafter model. Anyone reached those speeds or only the OP?! 🤔

English

Eric ⚡️ Building...@outsource_·24 Nis

The exact command I'm running llama-server \ -m Qwen3.6-27B-Q4_K_M.gguf \ -md Qwen3-1.7B-Q4_K_M.gguf \ -ngl 99 -ngld 99 -c 196608 -cd 32768 \ -fa on -ctk q4_0 -ctv q4_0 \ --draft-max 12 --draft-min 3 --draft-p-min 0.6 \ --host 0.0.0.0 --port 8081 For short-context, swap -c 196608 → -c 8192 and -ctk/-ctv → q8_0.

English

3.7K

bgeneto@netobge·24 Nis

@slap__tjips @outsource_ I think 4-bit is fine for V, but 4-bit for cached keys will kill model precision. Maybe to a point that is better use Qwen3.6 35B MoE instead (much faster and near quality).

English

slap tjips@slap__tjips·24 Nis

@outsource_ This is very impressive but the quality of output will take a knock at 4-bit KV cache as the context increases...

English

1.1K

bgeneto@netobge·24 Nis

@malikwas1f @TheAhmadOsman My main concern with your approach is the 3-bit turboquant for keys. We need some KL divergence benchmarks to ensure that using it does not make the dense model "dumber" than the Qwen3.6 35B MoE that runs at 130 tok/s with Intel AutoRound.

English

noname@malikwas1f·24 Nis

@netobge @TheAhmadOsman Thank you!

English

142

noname@malikwas1f·23 Nis

Qwen3.6-27B on ONE RTX 3090: ⚡ 85 TPS sustained (106 peak) 📏 125K context 👁 Vision + tool calls 🔌 230W cap — quiet & cool Consumer 24GB. Full OpenAI-compatible API. Single card. Further testing in progress — stay tuned for the write-up. @TheAhmadOsman @LottoLabs @KyleHessling1 @sudoingX @stevibe @0xSero @TeksEdge @Alibaba_Qwen @ivanfioravanti

English

322

16K

Keşfet

@CarlosZarattini @cjzafir @TeksEdge @loktar00 @malikwas1f @largePrawn @rafaon3 @luksamuk