bgeneto

122 posts

bgeneto banner
bgeneto

bgeneto

@netobge

가입일 Kasım 2022
139 팔로잉9 팔로워
Tech2Wild
Tech2Wild@Tech2Wild·
Hmm wondering which is better: Gemma 4 12B or Qwen 3.6 35B-A3? 🤔
English
35
1
50
18.7K
bgeneto
bgeneto@netobge·
@acadictive This 30x price increase should be considered illegal by any means!
English
0
0
0
8
Ehsan
Ehsan@acadictive·
Dear GitHub Copilot team, I am happy to announce that I successfully burned all of my monthly tokens in under 3 days thanks to your garbage new pricing model. I'd also like to inform you that I won't be renewing my subscription or adding more budget. Best, A former customer.
Ehsan tweet media
English
311
269
3.8K
723.8K
bgeneto
bgeneto@netobge·
@TheAhmadOsman Qwen3.7 Max is not open-source/open-weights, right? So OP is probably right, Let's see how M3 stacks against Kimi though
English
0
0
0
672
Ahmad
Ahmad@TheAhmadOsman·
Kimi K2.6 is still the current opensource SoTA model
English
53
13
558
54.2K
bgeneto
bgeneto@netobge·
@danieltvela @The_Only_Signal It really is... but then what would we call Qwen3.6 35B A3B (called Qwen3.6 Flash by Alibaba) running at 160tps on a single rtx 3090 with 262k context in vLLM vs 37 tps 27B then?
English
0
0
1
117
Mike Bradley
Mike Bradley@The_Only_Signal·
Spent the past 48 hours benchmarking Step3.7-Flash vs Qwen3.5-397B. Conclusion: People don’t appreciate Qwen3.6-27B enough.
English
44
14
579
37.2K
bgeneto
bgeneto@netobge·
@leo_linsky @The_Only_Signal Interesting to see the Qwen3.6 flash, a 35B A3B MoE, ahead of 27B dense and all the hype going to the much slower 27B
English
0
0
1
104
Leo Linsky
Leo Linsky@leo_linsky·
@The_Only_Signal Qwen 3.6 27B is such a ridiculous outlier in our testing that we had to re-evaluate our whole methodology. Data holds up. Incredible amount of intelligence AND agentic tool progress packed into a small param count. Data at gertlabs.com/rankings
Leo Linsky tweet media
English
3
0
15
1.7K
bgeneto
bgeneto@netobge·
@CarlosZarattini Deixa ver se eu entendi sua posição: então o dado que desmonta essa mentira de vez diz que menos de 3% dos beneficiários trabalham de carteira assinada... Puxa, 3% é realmente um desmonte...
Português
0
0
0
9
Carlos Zarattini
Carlos Zarattini@CarlosZarattini·
Absurda e desinformada essa declaração contra o Bolsa Família. É inadmissível que ainda seja preciso repetir o óbvio. O Bolsa Família é, sim, um estímulo à mobilidade social. O programa garante comida na mesa de quase 50 milhões de brasileiros e ajuda famílias inteiras a atravessarem a pobreza com dignidade. Entre 2023 e 2024, 8,6 milhões de pessoas saíram da pobreza e 1,9 milhão deixou a extrema pobreza no Brasil. E o dado que desmonta essa mentira de vez: em 2024, beneficiários do Bolsa Família ocuparam 1,2 milhão de postos formais de trabalho. Quem diz que o Bolsa Família “acomoda” simplesmente despreza a realidade do povo brasileiro. poder360.com.br/poder-economia…
Português
74
281
972
6.5K
bgeneto
bgeneto@netobge·
@cjzafir Have you published your model? Where?
English
0
0
0
42
CJ Zafir
CJ Zafir@cjzafir·
Qwen 3.5 has the best SLMs to fine-tune! Its 4B model is really smart if you train it on a well structured dataset. I fine-tuned the model on a 135M dataset generated by Codex 5.5 + DeepSeek v4 Pro. I achieved 96%+ accurate results with Qwen 3.5 4B. And 95% on Qwen 3.5 2B (that only requires 3.5GB RAM). For context, on the same pipeline: > Sonnet 4.6 achieved 89% > GPT 5.4 Mini achieved 85% > Haiku 4.5 achieved 72% I don't trust evals, so I ran a 7000+ row hard-boundary test, and the results of Qwen 3.5 were consistent. A 4B fine-tuned model beating a 20x bigger model in accuracy and latency is no joke. It cost me $173 in total to generate the dataset and cover the cloud GPU cost to fine-tune both models. I said this before, and I'll say it again: not everything requires a 1T-parameter LLM. We need ELMs (Expert Language Models) that are specialized for one domain only. ELMs > LLMs. I'll be writing more about how SLM fine-tuning works. So stay tuned.
CJ Zafir tweet media
English
33
69
695
28.1K
bgeneto
bgeneto@netobge·
@TeksEdge While anyone with a much cheaper RTX 3090 can run Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound with vLLM at 150t/s without MTP with fp8 kv cache and 128k context.
English
3
0
2
583
David Hendrickson
David Hendrickson@TeksEdge·
🤯 Unsloth released the fastest Qwen3.6-27B MTP GGUF I've tested. Time to upgrade. Compared to the previous GGUF, Q4/Q6 XL versions are 👀 ~55% faster! On a single RTX 5090: ✅ 114 tok/s — UD-IQ2_M (MTP) ✅ 93 tok/s — UD-Q4_K_XL (MTP) ✅ 75 tok/s — UD-Q6_K_XL (MTP) 💨Fastest MTP quant is 3.3x faster than the old Q8_0 baseline (35 tps) 262K context + tool calling. All on one 5090. * compiled from the MTP PR branch ('am17an:mtp-clean', build b9117-ebe4fca4b)
David Hendrickson tweet media
English
33
51
521
46.7K
bgeneto
bgeneto@netobge·
@loktar00 27B quality with 35B tps would be a dream with 3090. But all those inflated numbers with 27B and 3090 are unreal and quality compromised... 35B with vLLM is so stable that I stopped searching for a better alternative with a single 3090.
English
0
0
2
53
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
I wish 3.6 35B was just a little better.. the speeds I'm getting are insane.
Loktar 🇺🇸 tweet media
English
26
5
172
13.4K
bgeneto
bgeneto@netobge·
@malikwas1f @largePrawn I did... Several times, since day-0 and also today. No more then 60 tps with 2x3090. Much better tps with single gpu and qwen3.6 35B (130 tps) without even spec dec.
English
1
0
1
57
noname
noname@malikwas1f·
@netobge @largePrawn Those numbers are real life and real time. Go check out the repo and do it yourself.
English
1
0
0
73
Tony Ge
Tony Ge@largePrawn·
Hitting 140 tok/s on Qwen 3.6 27B running vLLM with 2x 3090s using the following @malikwas1f's repo github.com/noonghunna/clu… Literally just pointed claude at it and walked away. Came back to a 2.5x speed bump 🤯🤯🤯
Tony Ge tweet media
English
18
30
346
27.8K
bgeneto
bgeneto@netobge·
@rafaon3 @luksamuk A3B tá lento, tenta o Intel/Qwen3.6 35B AutoRound com vLLM, consigo 130 tok/s com ele e 60 tok/s com o Qwen3.6 27B, mas não uso pq esses modelos pensam d+ e 60 tps fica extremamente lento para codar com 128k tokens.
Português
0
0
0
45
rafaon3
rafaon3@rafaon3·
@luksamuk Rodo o a3 a 90tks na 3090 e 27 a 41tks não entendo mas aceito
Português
2
0
1
88
Lucas
Lucas@luksamuk·
Até agora, o coding champion aqui, numa RTX 3050 com 6GB de VRAM foi o Qwen 3.6 35B-A3B. Quantização: UD-Q3_K_L. Arquitetura MoE ajuda com velocidade; qualidade inigualável; bom tradeoff com velocidade. Não duvido que o 27B faça coisa melhor, mas é lento que dói (limitação minha)
Português
18
2
63
3.5K
bgeneto
bgeneto@netobge·
@EnioViterbo Que moral ele tem pra falar assim? Mundo louco esse, ministro do STF fingindo que embolsar +80 milhões e tá tudo bem, vida que segue. Só no Brasil.
Português
0
0
1
109
Enio Viterbo
Enio Viterbo@EnioViterbo·
Pelo amor de Deus. Se controla, Alexandre. O ministro Alexandre de Moraes aproveitou o julgamento de um processo do deputado Gustavo Gayer contra um outro deputado e simplesmente começou a mandar indiretas para o Romeu Zema. Um completo desvio de finalidade. Um desrespeito com o dinheiro público. Um desrespeito com o Direito e com o processo penal. Um desrespeito com o STF. Os ministros Alexandre de Moraes e Gilmar Mendes têm que aprender que não é porque tem um microfone ali na mesa que eles podem falar qualquer coisa. A sessão de julgamento dos processos é DOS PROCESSOS. Não é pra cantar. Não é pra recitar poesia. Não é pra mandar recados políticos. Se quiserem um microfone e uma bancada para dar recados políticos, candidatem-se ao Congresso.
Português
510
2.1K
11.8K
264.7K
bgeneto
bgeneto@netobge·
@MemoryReboot_ I'm getting 130-160 tok/s with one RTX 3090 and Intel AutoRound Qwen3.6 35B A3B without any spec dec. So you certainly have a regression in speed here.
English
1
0
2
298
Mass
Mass@MemoryReboot_·
DFlash benchmarks on dual RTX 3090 Qwen3.6-35B-A3B AWQ-INT4 + DFlash drafter on vLLM nightly, TP=2 Tried different num_speculative_tokens to find what works: - n=4: 96.6 tok/s - n=8: 96.0 tok/s - n=15 (z-lab's recommended): 20-40 tok/s n=4 is a sweet spot For comparison, Qwen3.6-35B-A3B Q6 on llama.cpp gives me 102 tok/s on the same hardware ☹️ What am I doing wrong?
Mass tweet media
English
12
3
50
4.7K
bgeneto
bgeneto@netobge·
@spiritbuun Still slower... forgot to mention that gguf model used is: lmstudio-community/Qwen3.6-27B-GGUF cmake -B build --fresh \ -DGGML_CUDA=ON \ -DGGML_NATIVE=ON \ -DGGML_CUDA_FA=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=86
bgeneto tweet media
English
1
0
1
59
buun
buun@spiritbuun·
@netobge Can you repull from master, build, and try it again? Might have fixed it
English
1
0
0
38
buun
buun@spiritbuun·
Pushed: DFlash implementation for llama-cpp. buun-llama-cpp/llama-server -m Qwen3.6-27B.gguf -md dflash-draft-q4_k_m.gguf --spec-type dflash
buun tweet media
Dansk
37
37
416
39.7K
bgeneto
bgeneto@netobge·
@spiritbuun ./build/bin/llama-server \ -m ~/models/Qwen3.6-27B-Q4_K_M.gguf -md ~/models/dflash-draft-3.6-q4_k_m.gguf --spec-type dflash \ --reasoning on \ --reasoning-budget -1 \ --ctx-size 32000 \ --fit off \ -ngl 99 -ngld 99 \ --flash-attn on \ -ctk q8_0 \ -ctv q8_0...
bgeneto tweet media
English
1
0
0
94
buun
buun@spiritbuun·
@netobge Can you paste me your llama flags so I can reproduce?
English
1
0
0
36
bgeneto
bgeneto@netobge·
@unbug @elliotarledge Yes. Bunn llama cpp has DFlash support but no luck for me, it worked but slower
English
0
0
0
47