bgeneto

115 posts

bgeneto banner
bgeneto

bgeneto

@netobge

Katılım Kasım 2022
138 Takip Edilen9 Takipçiler
bgeneto
bgeneto@netobge·
@CarlosZarattini Deixa ver se eu entendi sua posição: então o dado que desmonta essa mentira de vez diz que menos de 3% dos beneficiários trabalham de carteira assinada... Puxa, 3% é realmente um desmonte...
Português
0
0
0
9
Carlos Zarattini
Carlos Zarattini@CarlosZarattini·
Absurda e desinformada essa declaração contra o Bolsa Família. É inadmissível que ainda seja preciso repetir o óbvio. O Bolsa Família é, sim, um estímulo à mobilidade social. O programa garante comida na mesa de quase 50 milhões de brasileiros e ajuda famílias inteiras a atravessarem a pobreza com dignidade. Entre 2023 e 2024, 8,6 milhões de pessoas saíram da pobreza e 1,9 milhão deixou a extrema pobreza no Brasil. E o dado que desmonta essa mentira de vez: em 2024, beneficiários do Bolsa Família ocuparam 1,2 milhão de postos formais de trabalho. Quem diz que o Bolsa Família “acomoda” simplesmente despreza a realidade do povo brasileiro. poder360.com.br/poder-economia…
Português
74
284
975
6.4K
bgeneto
bgeneto@netobge·
@cjzafir Have you published your model? Where?
English
0
0
0
39
CJ Zafir
CJ Zafir@cjzafir·
Qwen 3.5 has the best SLMs to fine-tune! Its 4B model is really smart if you train it on a well structured dataset. I fine-tuned the model on a 135M dataset generated by Codex 5.5 + DeepSeek v4 Pro. I achieved 96%+ accurate results with Qwen 3.5 4B. And 95% on Qwen 3.5 2B (that only requires 3.5GB RAM). For context, on the same pipeline: > Sonnet 4.6 achieved 89% > GPT 5.4 Mini achieved 85% > Haiku 4.5 achieved 72% I don't trust evals, so I ran a 7000+ row hard-boundary test, and the results of Qwen 3.5 were consistent. A 4B fine-tuned model beating a 20x bigger model in accuracy and latency is no joke. It cost me $173 in total to generate the dataset and cover the cloud GPU cost to fine-tune both models. I said this before, and I'll say it again: not everything requires a 1T-parameter LLM. We need ELMs (Expert Language Models) that are specialized for one domain only. ELMs > LLMs. I'll be writing more about how SLM fine-tuning works. So stay tuned.
CJ Zafir tweet media
English
33
69
695
27.6K
bgeneto
bgeneto@netobge·
@TeksEdge While anyone with a much cheaper RTX 3090 can run Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound with vLLM at 150t/s without MTP with fp8 kv cache and 128k context.
English
3
0
2
580
David Hendrickson
David Hendrickson@TeksEdge·
🤯 Unsloth released the fastest Qwen3.6-27B MTP GGUF I've tested. Time to upgrade. Compared to the previous GGUF, Q4/Q6 XL versions are 👀 ~55% faster! On a single RTX 5090: ✅ 114 tok/s — UD-IQ2_M (MTP) ✅ 93 tok/s — UD-Q4_K_XL (MTP) ✅ 75 tok/s — UD-Q6_K_XL (MTP) 💨Fastest MTP quant is 3.3x faster than the old Q8_0 baseline (35 tps) 262K context + tool calling. All on one 5090. * compiled from the MTP PR branch ('am17an:mtp-clean', build b9117-ebe4fca4b)
David Hendrickson tweet media
English
33
51
523
46.6K
bgeneto
bgeneto@netobge·
@loktar00 27B quality with 35B tps would be a dream with 3090. But all those inflated numbers with 27B and 3090 are unreal and quality compromised... 35B with vLLM is so stable that I stopped searching for a better alternative with a single 3090.
English
0
0
2
53
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
I wish 3.6 35B was just a little better.. the speeds I'm getting are insane.
Loktar 🇺🇸 tweet media
English
26
5
173
13.4K
bgeneto
bgeneto@netobge·
@malikwas1f @largePrawn I did... Several times, since day-0 and also today. No more then 60 tps with 2x3090. Much better tps with single gpu and qwen3.6 35B (130 tps) without even spec dec.
English
1
0
1
57
noname
noname@malikwas1f·
@netobge @largePrawn Those numbers are real life and real time. Go check out the repo and do it yourself.
English
1
0
0
73
Tony Ge
Tony Ge@largePrawn·
Hitting 140 tok/s on Qwen 3.6 27B running vLLM with 2x 3090s using the following @malikwas1f's repo github.com/noonghunna/clu… Literally just pointed claude at it and walked away. Came back to a 2.5x speed bump 🤯🤯🤯
Tony Ge tweet media
English
18
30
347
27.8K
bgeneto
bgeneto@netobge·
@rafaon3 @luksamuk A3B tá lento, tenta o Intel/Qwen3.6 35B AutoRound com vLLM, consigo 130 tok/s com ele e 60 tok/s com o Qwen3.6 27B, mas não uso pq esses modelos pensam d+ e 60 tps fica extremamente lento para codar com 128k tokens.
Português
0
0
0
44
rafaon3
rafaon3@rafaon3·
@luksamuk Rodo o a3 a 90tks na 3090 e 27 a 41tks não entendo mas aceito
Português
2
0
1
88
Lucas
Lucas@luksamuk·
Até agora, o coding champion aqui, numa RTX 3050 com 6GB de VRAM foi o Qwen 3.6 35B-A3B. Quantização: UD-Q3_K_L. Arquitetura MoE ajuda com velocidade; qualidade inigualável; bom tradeoff com velocidade. Não duvido que o 27B faça coisa melhor, mas é lento que dói (limitação minha)
Português
18
2
63
3.5K
bgeneto
bgeneto@netobge·
@EnioViterbo Que moral ele tem pra falar assim? Mundo louco esse, ministro do STF fingindo que embolsar +80 milhões e tá tudo bem, vida que segue. Só no Brasil.
Português
0
0
1
109
Enio Viterbo
Enio Viterbo@EnioViterbo·
Pelo amor de Deus. Se controla, Alexandre. O ministro Alexandre de Moraes aproveitou o julgamento de um processo do deputado Gustavo Gayer contra um outro deputado e simplesmente começou a mandar indiretas para o Romeu Zema. Um completo desvio de finalidade. Um desrespeito com o dinheiro público. Um desrespeito com o Direito e com o processo penal. Um desrespeito com o STF. Os ministros Alexandre de Moraes e Gilmar Mendes têm que aprender que não é porque tem um microfone ali na mesa que eles podem falar qualquer coisa. A sessão de julgamento dos processos é DOS PROCESSOS. Não é pra cantar. Não é pra recitar poesia. Não é pra mandar recados políticos. Se quiserem um microfone e uma bancada para dar recados políticos, candidatem-se ao Congresso.
Português
512
2.2K
11.8K
264.6K
bgeneto
bgeneto@netobge·
@MemoryReboot_ I'm getting 130-160 tok/s with one RTX 3090 and Intel AutoRound Qwen3.6 35B A3B without any spec dec. So you certainly have a regression in speed here.
English
1
0
2
296
Mass
Mass@MemoryReboot_·
DFlash benchmarks on dual RTX 3090 Qwen3.6-35B-A3B AWQ-INT4 + DFlash drafter on vLLM nightly, TP=2 Tried different num_speculative_tokens to find what works: - n=4: 96.6 tok/s - n=8: 96.0 tok/s - n=15 (z-lab's recommended): 20-40 tok/s n=4 is a sweet spot For comparison, Qwen3.6-35B-A3B Q6 on llama.cpp gives me 102 tok/s on the same hardware ☹️ What am I doing wrong?
Mass tweet media
English
12
3
50
4.7K
bgeneto
bgeneto@netobge·
@spiritbuun Still slower... forgot to mention that gguf model used is: lmstudio-community/Qwen3.6-27B-GGUF cmake -B build --fresh \ -DGGML_CUDA=ON \ -DGGML_NATIVE=ON \ -DGGML_CUDA_FA=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=86
bgeneto tweet media
English
1
0
1
59
buun
buun@spiritbuun·
@netobge Can you repull from master, build, and try it again? Might have fixed it
English
1
0
0
38
buun
buun@spiritbuun·
Pushed: DFlash implementation for llama-cpp. buun-llama-cpp/llama-server -m Qwen3.6-27B.gguf -md dflash-draft-q4_k_m.gguf --spec-type dflash
buun tweet media
Dansk
37
37
419
39.6K
bgeneto
bgeneto@netobge·
@spiritbuun ./build/bin/llama-server \ -m ~/models/Qwen3.6-27B-Q4_K_M.gguf -md ~/models/dflash-draft-3.6-q4_k_m.gguf --spec-type dflash \ --reasoning on \ --reasoning-budget -1 \ --ctx-size 32000 \ --fit off \ -ngl 99 -ngld 99 \ --flash-attn on \ -ctk q8_0 \ -ctv q8_0...
bgeneto tweet media
English
1
0
0
94
buun
buun@spiritbuun·
@netobge Can you paste me your llama flags so I can reproduce?
English
1
0
0
36
bgeneto
bgeneto@netobge·
@unbug @elliotarledge Yes. Bunn llama cpp has DFlash support but no luck for me, it worked but slower
English
0
0
0
47
bgeneto
bgeneto@netobge·
@spiritbuun I'm getting around 0.54x tok/s with this. RTX 3090 here: 37 tok/s vanilla and 20 tok/s with dflash. 😟 draft acceptance rate = 0.25519 ( 1401 accepted / 5490 generated) statistics dflash: #calls(b,g,a) = 1 366 304, #gen drafts = 366, #acc drafts = 304, #gen tokens = 5490, #acc
English
1
0
0
73
buun
buun@spiritbuun·
I get around 2x tok/s on average with this, more if it's simple code/json.
English
1
0
14
3.5K
Sandro
Sandro@pupposandro·
TQ3_0 (TurboQuant) KV cache just landed in Lucebox Hub. 22% less VRAM than Q4_0, same decode speed. 262K context on a single RTX 3090 with 1024 MiB to spare. Qwen3.5-27B, Q4_K_M target, DFlash speculative decode. TurboQuant 3.5 bpv with FWHT rotation, CUDA kernels end-to-end, flash-attention plugged in for both K and V. Prefill pays ~12% for the rotation, decode pays nothing. Huge thanks to @dusterbloom for providing this to the community. Repo as usual in the first comment ⬇️
Sandro tweet media
English
21
13
137
7K
bgeneto
bgeneto@netobge·
@ray5ar @outsource_ Don't work for me, not even using ik_llama, running at 34 tok/s, slower than without spec dec
English
1
0
2
29
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
Quick update: pushed the 4090 further!💡 192K context at 152 tok/s on Qwen3.6-27B, single GPU. 128K hits 159. Same Q4_K_M. Vanilla Qwen3-1.7B draft beat the distilled 4B draft. Smaller > smarter for spec-dec. Next: 1M context locally + 250-400 tok/s via DFlash + TurboQuant. Receipts coming.
Eric ⚡️ Building... tweet media
Eric ⚡️ Building...@outsource_

My 4090 went from 26 -> 154 tok/s Qwen 3.6 27B🤯 Same GPU. Same Q4_K_M . No FP8, no extra quant. The unlock: ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model. 85% acceptance rate. Full config + benchmarks 👇🏻

English
49
42
481
41.4K
bgeneto
bgeneto@netobge·
@outsource_ Does NOT work! At least for me... Getting 34 tok/s only with my RTX 3090. 45 tok/s without Qwen3 drafter model. Anyone reached those speeds or only the OP?! 🤔
English
0
0
3
81
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
The exact command I'm running llama-server \ -m Qwen3.6-27B-Q4_K_M.gguf \ -md Qwen3-1.7B-Q4_K_M.gguf \ -ngl 99 -ngld 99 -c 196608 -cd 32768 \ -fa on -ctk q4_0 -ctv q4_0 \ --draft-max 12 --draft-min 3 --draft-p-min 0.6 \ --host 0.0.0.0 --port 8081 For short-context, swap -c 196608 → -c 8192 and -ctk/-ctv → q8_0.
English
9
3
42
3.7K
bgeneto
bgeneto@netobge·
@slap__tjips @outsource_ I think 4-bit is fine for V, but 4-bit for cached keys will kill model precision. Maybe to a point that is better use Qwen3.6 35B MoE instead (much faster and near quality).
English
0
0
1
43
slap tjips
slap tjips@slap__tjips·
@outsource_ This is very impressive but the quality of output will take a knock at 4-bit KV cache as the context increases...
English
12
0
2
1.1K
bgeneto
bgeneto@netobge·
@malikwas1f @TheAhmadOsman My main concern with your approach is the 3-bit turboquant for keys. We need some KL divergence benchmarks to ensure that using it does not make the dense model "dumber" than the Qwen3.6 35B MoE that runs at 130 tok/s with Intel AutoRound.
English
1
0
1
41