Sabitlenmiş Tweet
Marcus Eisele
5.9K posts

Marcus Eisele
@eiselems
10y+ Engineer leveraging AI to ship MVPs while working 9-5. Sharing the tech stack & workflows to build faster without burnout.
Stuttgart Area, Germany Katılım Ağustos 2015
1.1K Takip Edilen3.2K Takipçiler

@yago_r0d @knowRowan @SynapticaTech Just to check qwen 35b a3b not 27b right?
For me the fit parameter did the trick by doing everything automatically.
What kind of hardware besides the 3080 do you have? Do you have 32gb ram?
English

@eiselems @knowRowan @SynapticaTech I'm struggling to make it work on my 3080, aren't you using -ngl 99 --n-cpu-moe 23 ?
Values other than -b 1024 -ub 1024 seems to produce a big delay
With low context works well but as soon as it goes 50k start to take a lot of time..
Any advice?
English

@Hikari_07_jp This is the noise I "heard" when seeing your first picture.
Jealous of that setup, put it to good use :)
English

@leftcurvedev_ Mtp is where my 8gb vram build just can't take it anymore at 130k context. It fails to load due to OOM, feel like mtp for qwen is an additional 2gb on the gpu (which I don't have)
English

2 draft tokens is the sweet spot for Unsloth's new Qwen3.6 MTP GGUF models. Going higher drops the acceptance rate significantly (as Daniel mentioned)
--spec-type mtp --spec-draft-n-max 2
40% speed-up over the original 27B
15-20% over the original 35B
Daniel Han@danielhanchen
We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide + GGUFs + Benchmarks: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw…
In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama…! English

@yago_r0d @knowRowan @SynapticaTech build llama cpp from source with cuda:
win64: .\llama-server.exe -m .\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -fa 1 -ctk q4_0 -ctv q4_0 -b 4096 -ub 2048 --no-mmap --jinja -t 6 -c 131072 --fit on --alias qwen36
This is as you can see 35B-A3B Q4_K_M

English

@eiselems That “let the model benchmark the setup” loop is the interesting part. The next failure mode is whether it can preserve the artifacts and explain exactly why one run beats another.
English

@NVIDIAGeForceDE #007FirstLightRTX stellt euch vor ihr seid früh im Post und dann geht er trending😅
Ich hätte sie auch gerne wie alle anderen hier, viel Glück an alle!
Deutsch

Rekruten, euer erster Preis ist da …
Eine individuelle GeForce RTX 5080 Founders Edition plus eine PC-Version des Spiels.
Kommentiert mit #007FirstLightRTX, um zu gewinnen 👇
Deutsch

@opencode OpenCode creates more bugs than it fixes.
I have reported this issue on GitHub, wasting too much time. Other users are experiencing the same problems, and it feels like nobody give a F
#issuecomment-4425600987" target="_blank" rel="nofollow noopener">github.com/anomalyco/open…

English

@thekitze got a mail from IT at work because how I dare to compile llama-server from source 😢(it was flagged by some heuristic scan shit ...)
English

ok craziest interaction with my agent I had so far ...
It read a mail I received from @owocki after clearing a gitcoin bounty in 2018 ...
Had to purge the convo because it didn't stop hyping me up for interacting with a celebrity ... and forgot about the task🤣
English

@HermesAgentTips I used to be a crypto minor the way that I run my cards now for a local inference is downright kind compared to that
English

@thekitze @OpenAIDevs Guess it was busy one shotting the landing page of.... 😂
English

@knowRowan @SynapticaTech I was REALLY surprised how well qwen 3.6 35b a3b runs on my limited hardware.
Only having 3b active parameters makes all the difference.
5600x 32gb ddr4 + rtx3070 8gb vram.
~800 prefill / ~25 t/s at 130k context
English

@SynapticaTech Prolly gon break the bank but I suppose I ought to transition to that soon
English

For me it was not worth it but the thing is that it is almost running at capacity with heavy offloading.
Have a 5600x with 32gb ddr4 + rtx3070 (8GB Vram).
Can drive qwen 3.6 35b a3b to around ~150k context while keeping it usable ~20-25t/s.
MTP and also ik_llama didn't make any difference for me, DDR 4 is the limit here
English

@steipete with that release interval, I am not surprised tbh still that repo is kind of crazy ngl
More commits in a day than an enterprise repo in a year 🤣

English

@leftcurvedev_ @UnslothAI :/ but still impressed with the baseline so far... (was with -c 131072)

English

@eiselems @UnslothAI Same for 35B so far, ggufs are heavier in size and I feel vram stress is not helping us here. I’ll try again and let you know if I manage to make it work
English

Here are my results for Qwen3.6 27B MTP model vs base setup: ~30% extra speed 🔥
Used the specific MTP PR branch and downloaded the new GGUF from @UnslothAI
git clone -b mtp-clean github.com/am17an/llama.c…
--spec-type draft-mtp --spec-draft-n-max 2
huggingface.co/unsloth/Qwen3.…
English















