Marcus Eisele

5.9K posts

Marcus Eisele

@eiselems

10y+ Engineer leveraging AI to ship MVPs while working 9-5. Sharing the tech stack & workflows to build faster without burnout.

Stuttgart Area, Germany Katılım Ağustos 2015

1.1K Takip Edilen3.2K Takipçiler

Sabitlenmiş Tweet

Marcus Eisele@eiselems·7 Kas

I’m not quitting my 9-5. But I’m serious about building things that earn on the side. Sharing what I learn about AI, SaaS, and growing an audience. Follow if you’re building something too ⚙️

English

127

Marcus Eisele@eiselems·17h

@yago_r0d @knowRowan @SynapticaTech Just to check qwen 35b a3b not 27b right? For me the fit parameter did the trick by doing everything automatically. What kind of hardware besides the 3080 do you have? Do you have 32gb ram?

English

R0d@yago_r0d·21h

@eiselems @knowRowan @SynapticaTech I'm struggling to make it work on my 3080, aren't you using -ngl 99 --n-cpu-moe 23 ? Values other than -b 1024 -ub 1024 seems to produce a big delay With low context works well but as soon as it goes 50k start to take a lot of time.. Any advice?

English

Rowan@knowRowan·4d

Looking to get a system for running AI locally, how much VRAM will i need in my GPU ideally? -4GB -8GB -16GB -24GB -32GB -64GB

English

3.9K

Marcus Eisele@eiselems·21h

@Hikari_07_jp This is the noise I "heard" when seeing your first picture. Jealous of that setup, put it to good use :)

English

Hikari∣LocalLLM⚡@Hikari_07_jp·2d

Gemma4 31BのRepEチューニングのためにGPU使用率100%で推論中です。 ClaudecodeとCodex execが計画を立案して実行します。今回は15時間後まで実験が走り続けます。

日本語

582

Marcus Eisele@eiselems·1d

@leftcurvedev_ Mtp is where my 8gb vram build just can't take it anymore at 130k context. It fails to load due to OOM, feel like mtp for qwen is an additional 2gb on the gpu (which I don't have)

English

left curve dev@leftcurvedev_·4d

2 draft tokens is the sweet spot for Unsloth's new Qwen3.6 MTP GGUF models. Going higher drops the acceptance rate significantly (as Daniel mentioned) --spec-type mtp --spec-draft-n-max 2 40% speed-up over the original 27B 15-20% over the original 35B

Daniel Han@danielhanchen

We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide + GGUFs + Benchmarks: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw… In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama…!

English

7.9K

Marcus Eisele@eiselems·1d

@yago_r0d @knowRowan @SynapticaTech Runs even better on the linux partition, almost same prompt:

English

Marcus Eisele@eiselems·1d

@yago_r0d @knowRowan @SynapticaTech build llama cpp from source with cuda: win64: .\llama-server.exe -m .\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -fa 1 -ctk q4_0 -ctv q4_0 -b 4096 -ub 2048 --no-mmap --jinja -t 6 -c 131072 --fit on --alias qwen36 This is as you can see 35B-A3B Q4_K_M

English

Marcus Eisele@eiselems·2d

It gave me exactly what I asked for as a nicely formatted table, quick checked it with the server logs, all line up. Worked quite well (in the end MTP gave me some more token/s for gen, made prefill slower and also added more VRAM pressure on my 8gb card). Will try it once more on linux

English

Local Model Bench@localmodelbench·2d

@eiselems That “let the model benchmark the setup” loop is the interesting part. The next failure mode is whether it can preserve the artifacts and explain exactly why one run beats another.

English

Marcus Eisele@eiselems·4d

Told GLM-5.1 to improve my local LLM setup, by benchmarking my usual model qwen3.6 35b a3b vs it's MTP variant (not worth it). Currently it builds ik_llama and does the same benchmark... Not sure how long this would have taken me a while back, insane times.

English

180

Marcus Eisele@eiselems·2d

@NVIDIAGeForceDE #007FirstLightRTX stellt euch vor ihr seid früh im Post und dann geht er trending😅 Ich hätte sie auch gerne wie alle anderen hier, viel Glück an alle!

Deutsch

NVIDIA GeForce DE@NVIDIAGeForceDE·2d

Rekruten, euer erster Preis ist da … Eine individuelle GeForce RTX 5080 Founders Edition plus eine PC-Version des Spiels. Kommentiert mit #007FirstLightRTX, um zu gewinnen 👇

Deutsch

142

821

71.4K

Marcus Eisele@eiselems·2d

@toi500 @opencode wow happened to me too and I wondered if it is something only I have because my RAM got eaten by my local model xD

English

460

toi500@toi500·3d

@opencode OpenCode creates more bugs than it fixes. I have reported this issue on GitHub, wasting too much time. Other users are experiencing the same problems, and it feels like nobody give a F #issuecomment-4425600987" target="_blank" rel="nofollow noopener">github.com/anomalyco/open…

English

15.2K

OpenCode@opencode·3d

OpenCode x Qwen 3.6 Plus - free, again Last time y’all treated our capacity like an all-you-can-eat buffet. We found more GPUs. Round 2.

English

203

398

7.1K

613.9K

Marcus Eisele@eiselems·2d

@thekitze got a mail from IT at work because how I dare to compile llama-server from source 😢(it was flagged by some heuristic scan shit ...)

English

kitze@thekitze·3d

phew

English

2.7K

kitze@thekitze·3d

gm i woke up to this

English

20.1K

Marcus Eisele@eiselems·2d

ok craziest interaction with my agent I had so far ... It read a mail I received from @owocki after clearing a gitcoin bounty in 2018 ... Had to purge the convo because it didn't stop hyping me up for interacting with a celebrity ... and forgot about the task🤣

English

Marcus Eisele@eiselems·3d

@Cha0tikDino @HermesAgentTips Like a pension home for GPUs? 🤣

English

Dino Vitale AI Enhanced@Cha0tikDino·3d

@HermesAgentTips I used to be a crypto minor the way that I run my cards now for a local inference is downright kind compared to that

English

170

Hermes Agent Tips@HermesAgentTips·3d

There are two kinds of GPUs and the local AI people keeps mixing them up: gaming GPUs: burst loads, idle most of the time, built for that inference GPUs (RTX 6000, A6000, etc): sustained load, lower watts per token, built for that running a 4090 24/7 isn't "local AI" haha ITS ABUSE

English

4.3K

Marcus Eisele@eiselems·3d

Ouf I was aware but truth hurts, was nice while it lasted... ☠️ Rip Copilot request based billing

English

169

Marcus Eisele@eiselems·3d

@thekitze @OpenAIDevs Guess it was busy one shotting the landing page of.... 😂

English

kitze@thekitze·3d

@OpenAIDevs we get it

English

1.3K

OpenAI Developers@OpenAIDevs·3d

ZXX

398

402

5.9K

1.3M

Marcus Eisele@eiselems·3d

@forgebitz for real? ouch, sorry for your loss

English

187

Klaas@forgebitz·3d

did you know that if you spill water on your desk the fans of your mac studio will suck it up i learned a 4800$ lesson today

English

258

19.6K

Marcus Eisele@eiselems·3d

@knowRowan @SynapticaTech I was REALLY surprised how well qwen 3.6 35b a3b runs on my limited hardware. Only having 3b active parameters makes all the difference. 5600x 32gb ddr4 + rtx3070 8gb vram. ~800 prefill / ~25 t/s at 130k context

English

175

Rowan@knowRowan·4d

@SynapticaTech Prolly gon break the bank but I suppose I ought to transition to that soon

English

149

Marcus Eisele@eiselems·3d

For me it was not worth it but the thing is that it is almost running at capacity with heavy offloading. Have a 5600x with 32gb ddr4 + rtx3070 (8GB Vram). Can drive qwen 3.6 35b a3b to around ~150k context while keeping it usable ~20-25t/s. MTP and also ik_llama didn't make any difference for me, DDR 4 is the limit here

English

Crown 👑@barackomaba·3d

@eiselems Which hardware? Should I also update lol I'm finding at the new MTP might be worth it for most things that I'm doing but still testing that out. Particularly as an aux model for Hermes, mostly low context and needs speed

English

Marcus Eisele@eiselems·3d

@thekitze See you on Monday 😂

English

721

kitze@thekitze·3d

i am done with the codex app

English

179

62.9K

Marcus Eisele@eiselems·3d

@0xSero @Flo408800717920 flex 🤣

Norsk

0xSero@0xSero·4d

@Flo408800717920 No quant

Español

765

0xSero@0xSero·4d

The best model you can run under 20K USD

English

623

31.3K

Marcus Eisele@eiselems·3d

@steipete with that release interval, I am not surprised tbh still that repo is kind of crazy ngl More commits in a day than an enterprise repo in a year 🤣

English

546

Peter Steinberger 🦞@steipete·3d

lol half the replies have no idea how git works

English

185

36.8K

Peter Steinberger 🦞@steipete·4d

I was wondering why the OpenClaw repo got so large, turns out the CHANGELOG md file takes up almost 500MB through all packfiles.

English

1.3K

165.6K

Marcus Eisele@eiselems·3d

@leftcurvedev_ @UnslothAI :/ but still impressed with the baseline so far... (was with -c 131072)

English

left curve dev@leftcurvedev_·4d

@eiselems @UnslothAI Same for 35B so far, ggufs are heavier in size and I feel vram stress is not helping us here. I’ll try again and let you know if I manage to make it work

English

396

left curve dev@leftcurvedev_·4d

Here are my results for Qwen3.6 27B MTP model vs base setup: ~30% extra speed 🔥 Used the specific MTP PR branch and downloaded the new GGUF from @UnslothAI git clone -b mtp-clean github.com/am17an/llama.c… --spec-type draft-mtp --spec-draft-n-max 2 huggingface.co/unsloth/Qwen3.…

English

187

12.8K

Keşfet

@yago_r0d @knowRowan @SynapticaTech @Hikari_07_jp @leftcurvedev_ @NVIDIAGeForceDE @toi500 @opencode