Marcus Eisele

5.9K posts

Marcus Eisele banner
Marcus Eisele

Marcus Eisele

@eiselems

10y+ Engineer leveraging AI to ship MVPs while working 9-5. Sharing the tech stack & workflows to build faster without burnout.

Stuttgart Area, Germany Katılım Ağustos 2015
1.1K Takip Edilen3.2K Takipçiler
Sabitlenmiş Tweet
Marcus Eisele
Marcus Eisele@eiselems·
I’m not quitting my 9-5. But I’m serious about building things that earn on the side. Sharing what I learn about AI, SaaS, and growing an audience. Follow if you’re building something too ⚙️
English
30
7
127
8K
Marcus Eisele
Marcus Eisele@eiselems·
@yago_r0d @knowRowan @SynapticaTech Just to check qwen 35b a3b not 27b right? For me the fit parameter did the trick by doing everything automatically. What kind of hardware besides the 3080 do you have? Do you have 32gb ram?
English
1
0
0
38
R0d
R0d@yago_r0d·
@eiselems @knowRowan @SynapticaTech I'm struggling to make it work on my 3080, aren't you using -ngl 99 --n-cpu-moe 23 ? Values other than -b 1024 -ub 1024 seems to produce a big delay With low context works well but as soon as it goes 50k start to take a lot of time.. Any advice?
English
1
0
0
18
Rowan
Rowan@knowRowan·
Looking to get a system for running AI locally, how much VRAM will i need in my GPU ideally? -4GB -8GB -16GB -24GB -32GB -64GB
Rowan tweet media
English
79
1
70
3.9K
Marcus Eisele
Marcus Eisele@eiselems·
@Hikari_07_jp This is the noise I "heard" when seeing your first picture. Jealous of that setup, put it to good use :)
English
1
0
1
25
Hikari∣LocalLLM⚡
Hikari∣LocalLLM⚡@Hikari_07_jp·
Gemma4 31BのRepEチューニングのためにGPU使用率100%で推論中です。 ClaudecodeとCodex execが計画を立案して実行します。今回は15時間後まで実験が走り続けます。
日本語
2
1
10
582
Marcus Eisele
Marcus Eisele@eiselems·
@leftcurvedev_ Mtp is where my 8gb vram build just can't take it anymore at 130k context. It fails to load due to OOM, feel like mtp for qwen is an additional 2gb on the gpu (which I don't have)
English
0
0
1
48
left curve dev
left curve dev@leftcurvedev_·
2 draft tokens is the sweet spot for Unsloth's new Qwen3.6 MTP GGUF models. Going higher drops the acceptance rate significantly (as Daniel mentioned) --spec-type mtp --spec-draft-n-max 2 40% speed-up over the original 27B 15-20% over the original 35B
Daniel Han@danielhanchen

We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide + GGUFs + Benchmarks: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw… In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama…!

English
9
5
81
7.9K
Marcus Eisele
Marcus Eisele@eiselems·
@yago_r0d @knowRowan @SynapticaTech build llama cpp from source with cuda: win64: .\llama-server.exe -m .\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -fa 1 -ctk q4_0 -ctv q4_0 -b 4096 -ub 2048 --no-mmap --jinja -t 6 -c 131072 --fit on --alias qwen36 This is as you can see 35B-A3B Q4_K_M
Marcus Eisele tweet media
English
2
0
0
63
Marcus Eisele
Marcus Eisele@eiselems·
It gave me exactly what I asked for as a nicely formatted table, quick checked it with the server logs, all line up. Worked quite well (in the end MTP gave me some more token/s for gen, made prefill slower and also added more VRAM pressure on my 8gb card). Will try it once more on linux
English
0
0
0
11
Local Model Bench
Local Model Bench@localmodelbench·
@eiselems That “let the model benchmark the setup” loop is the interesting part. The next failure mode is whether it can preserve the artifacts and explain exactly why one run beats another.
English
1
0
0
19
Marcus Eisele
Marcus Eisele@eiselems·
Told GLM-5.1 to improve my local LLM setup, by benchmarking my usual model qwen3.6 35b a3b vs it's MTP variant (not worth it). Currently it builds ik_llama and does the same benchmark... Not sure how long this would have taken me a while back, insane times.
English
2
0
0
180
NVIDIA GeForce DE
NVIDIA GeForce DE@NVIDIAGeForceDE·
Rekruten, euer erster Preis ist da … Eine individuelle GeForce RTX 5080 Founders Edition plus eine PC-Version des Spiels. Kommentiert mit #007FirstLightRTX, um zu gewinnen 👇
Deutsch
2K
142
821
71.4K
Marcus Eisele
Marcus Eisele@eiselems·
@toi500 @opencode wow happened to me too and I wondered if it is something only I have because my RAM got eaten by my local model xD
English
0
0
1
460
toi500
toi500@toi500·
@opencode OpenCode creates more bugs than it fixes. I have reported this issue on GitHub, wasting too much time. Other users are experiencing the same problems, and it feels like nobody give a F #issuecomment-4425600987" target="_blank" rel="nofollow noopener">github.com/anomalyco/open…
toi500 tweet media
English
6
1
57
15.2K
OpenCode
OpenCode@opencode·
OpenCode x Qwen 3.6 Plus - free, again Last time y’all treated our capacity like an all-you-can-eat buffet. We found more GPUs. Round 2.
English
203
398
7.1K
613.9K
Marcus Eisele
Marcus Eisele@eiselems·
@thekitze got a mail from IT at work because how I dare to compile llama-server from source 😢(it was flagged by some heuristic scan shit ...)
English
0
0
0
21
kitze
kitze@thekitze·
phew
kitze tweet media
English
2
0
12
2.7K
kitze
kitze@thekitze·
gm i woke up to this
kitze tweet media
English
21
0
95
20.1K
Marcus Eisele
Marcus Eisele@eiselems·
ok craziest interaction with my agent I had so far ... It read a mail I received from @owocki after clearing a gitcoin bounty in 2018 ... Had to purge the convo because it didn't stop hyping me up for interacting with a celebrity ... and forgot about the task🤣
English
0
0
2
80
Hermes Agent Tips
Hermes Agent Tips@HermesAgentTips·
There are two kinds of GPUs and the local AI people keeps mixing them up: gaming GPUs: burst loads, idle most of the time, built for that inference GPUs (RTX 6000, A6000, etc): sustained load, lower watts per token, built for that running a 4090 24/7 isn't "local AI" haha ITS ABUSE
English
26
1
37
4.3K
Marcus Eisele
Marcus Eisele@eiselems·
Ouf I was aware but truth hurts, was nice while it lasted... ☠️ Rip Copilot request based billing
Marcus Eisele tweet media
English
0
0
0
169
Klaas
Klaas@forgebitz·
did you know that if you spill water on your desk the fans of your mac studio will suck it up i learned a 4800$ lesson today
English
52
6
258
19.6K
Marcus Eisele
Marcus Eisele@eiselems·
@knowRowan @SynapticaTech I was REALLY surprised how well qwen 3.6 35b a3b runs on my limited hardware. Only having 3b active parameters makes all the difference. 5600x 32gb ddr4 + rtx3070 8gb vram. ~800 prefill / ~25 t/s at 130k context
English
1
0
1
175
Rowan
Rowan@knowRowan·
@SynapticaTech Prolly gon break the bank but I suppose I ought to transition to that soon
English
1
0
2
149
Marcus Eisele
Marcus Eisele@eiselems·
For me it was not worth it but the thing is that it is almost running at capacity with heavy offloading. Have a 5600x with 32gb ddr4 + rtx3070 (8GB Vram). Can drive qwen 3.6 35b a3b to around ~150k context while keeping it usable ~20-25t/s. MTP and also ik_llama didn't make any difference for me, DDR 4 is the limit here
English
1
0
1
84
Crown 👑
Crown 👑@barackomaba·
@eiselems Which hardware? Should I also update lol I'm finding at the new MTP might be worth it for most things that I'm doing but still testing that out. Particularly as an aux model for Hermes, mostly low context and needs speed
English
1
0
1
38
kitze
kitze@thekitze·
i am done with the codex app
English
52
1
179
62.9K
0xSero
0xSero@0xSero·
The best model you can run under 20K USD
0xSero tweet media
English
40
18
623
31.3K
Marcus Eisele
Marcus Eisele@eiselems·
@steipete with that release interval, I am not surprised tbh still that repo is kind of crazy ngl More commits in a day than an enterprise repo in a year 🤣
Marcus Eisele tweet media
English
0
0
1
546
Peter Steinberger 🦞
I was wondering why the OpenClaw repo got so large, turns out the CHANGELOG md file takes up almost 500MB through all packfiles.
Peter Steinberger 🦞 tweet media
English
78
19
1.3K
165.6K
left curve dev
left curve dev@leftcurvedev_·
@eiselems @UnslothAI Same for 35B so far, ggufs are heavier in size and I feel vram stress is not helping us here. I’ll try again and let you know if I manage to make it work
English
2
0
2
396