witcheer ☯︎

12.4K posts

witcheer ☯︎

@witcheer

Head of Growth @YariFinance | Founder @Broad_Land | Prev @KPMG

huggingface.co/witcheer Katılım Ekim 2021

1.2K Takip Edilen9.6K Takipçiler

Sabitlenmiş Tweet

witcheer ☯︎@witcheer·4 May

x.com/i/article/2051…

ZXX

264

27.2K

witcheer ☯︎@witcheer·15m

ollama 0.24.0 is out and wtf is their new desktop coding app called codex? built-in browser for annotating web pages. review mode for commenting and iterating on code. model switching between local and cloud. all running against whatever model you have pulled locally. recommended models from the release notes: kimi-k2.6 and glm-5.1 for hard coding and agentic tasks (cloud via ollama). nemotron-3-super, gemma4:31b, and qwen3.6 for running entirely local. also in this release: the MLX sampler got reworked for better generation quality on apple silicon. if your mac mini outputs have felt slightly off compared to the same model on CUDA, this is the fix. github.com/ollama/ollama/…

English

witcheer ☯︎@witcheer·22m

@JakeKAllDay thanks Jake lemme have a read

English

Jake@JakeKAllDay·37m

@witcheer FYI I know you've been trying out moe offload as well

English

Jake@JakeKAllDay·1h

A fun finding for the 12GB VRAM + MOE RAM Offload gang: 60 tok/s sustained with Qwen 3.6 35a3b! (MTP +Turboquant and a few other tricks) 🧵

English

witcheer ☯︎@witcheer·1h

This morning I was looking for a Codex-ish app for Claude Code, is there really nothing?? I’ve been loyal to Anthropic for a year now but I’ve never been this close to switching

English

151

witcheer ☯︎@witcheer·3h

@JC_Midwest yes! did with Qwen, check my latest posts and datasets on HF!

English

Jay | The Midwest Dad |@JC_Midwest·4h

@witcheer Have you tried squeezing in a 27b model? I realize not really ideal, but found 3.5 qwen surprisingly functional if you don’t mind waiting a bit lol

English

witcheer ☯︎@witcheer·5h

tested Mistral 7B v0.3 Instruct on my RTX 4060 Ti 8GB. 7.3B params, all active on every token. dense transformer, no MoE tricks. 4.1 GB on disk (Q4_K_M). >56.4 tok/s at baseline. degrades to 26.4 at 32K context (-53%), but no hard cliff. the interesting part is quality. I ran 6 targeted tests (JSON, code gen, logic, system prompt adherence, hallucination resistance, format switching). passed 5/6: best score in my test set. the extra params show, especially on reasoning tasks. however, it eats 6.7 GB VRAM at idle. only 1.5 GB free. every parameter is active, no offloading tricks to play. you get the full model or nothing (could have missed something tho). no thinking mode, no reasoning_content overhead. every token is content. simple to use with a local agent. full results with prompt/response traces: huggingface.co/datasets/witch…

English

1.5K

witcheer ☯︎@witcheer·4h

@beli_if thank you soooo much!

English

If Beli@beli_if·4h

@witcheer Keep sharing brother :) Right now you are my most favorite twitter account.

English

witcheer ☯︎@witcheer·1d

Seeing 3.5k bookmarks in just 2 weeks means more to me than any impression I could make. It shows y’all find my content interesting enough to bookmark and come back to later when you have more time. I really appreciate it!

English

386

witcheer ☯︎@witcheer·5h

@JakeKAllDay yes sir

English

Jake@JakeKAllDay·18h

@witcheer What's the MT/s on your RAM? DDR4 yea?

English

128

witcheer ☯︎@witcheer·1d

>yesterday: Gemma 4 26B MoE, 29.3 tok/s on my RTX 4060 Ti 8GB. >today: Gemma 4 E2B dense, 117.8 tok/s. same rig. 2.9 GB on disk. 2.6 GB VRAM. 5.5 GB free. no tuning needed. the smallest model is the fastest by 2x. four models tested, updated dataset with traces: huggingface.co/datasets/witch…

witcheer ☯︎@witcheer

Gemma 4 26B A4B on RTX 4060 Ti 8GB. tested it. > 29.3 tok/s decode at 32K context > ncmoe 23 sweet spot (7 of 30 expert layers on GPU) > 16 GB Q4_K_M - fits comfortably in 32GB RAM > 490 MiB VRAM headroom vs Qwen's razor-thin 37 head-to-head vs Qwen3.6 35B A3B (same rig, same method): > Qwen: 35.4 tok/s at 32K - faster raw decode > Gemma: 29.3 tok/s at 32K - 6 GB lighter on disk > at 65K: Gemma 25.8 tok/s, Qwen 17.4 tok/s - Qwen hits the VRAM cliff, Gemma doesn't the sliding window attention (5:1 SWA-to-full pattern) does for context scaling what Qwen's hybrid SSM does: keeps KV cache growth sublinear. data + methodology on HF: huggingface.co/datasets/witch…

English

119

9.7K

witcheer ☯︎@witcheer·5h

@AnuranBuilds great read!

English

anuran 🛠️ Alchemyst AI(e/acc)@AnuranBuilds·1d

x.com/i/article/2054…

ZXX

953

witcheer ☯︎@witcheer·1d

@tieuho2k7 everything is on the HF link sir!

English

293

tieuho2k7@tieuho2k7·1d

@witcheer can u share cli command to run it with params?

English

320

witcheer ☯︎@witcheer·1d

@micheltamanda good question, I'll try!

English

329

Michel Laclé@micheltamanda·1d

@witcheer Can this model code basic htm/js?

English

376

witcheer ☯︎@witcheer·1d

@leftcurvedev_ @UnslothAI tha'ts soo cool

English

left curve dev@leftcurvedev_·1d

@witcheer @UnslothAI vibecoded it with 35B 👀

English

198

left curve dev@leftcurvedev_·1d

Here are my results for Qwen3.6 27B MTP model vs base setup: ~30% extra speed 🔥 Used the specific MTP PR branch and downloaded the new GGUF from @UnslothAI git clone -b mtp-clean github.com/am17an/llama.c… --spec-type draft-mtp --spec-draft-n-max 2 huggingface.co/unsloth/Qwen3.…

English

185

12.3K

witcheer ☯︎@witcheer·1d

x.com/i/article/2054…

ZXX

531

witcheer ☯︎@witcheer·1d

@ItsmeAjayKV thanks king

English

AJ@ItsmeAjayKV·1d

@witcheer Oh that's crazy stats, but not surprising considering you're sharing valuable information. Keep on going !

English

witcheer ☯︎@witcheer·1d

@prodbitz yes! check my previous posts

English

DZ@prodbitz·1d

@witcheer Have you test qwen3.6-27b ？

English

witcheer ☯︎@witcheer·1d

I used to run everything through ollama and LM Studio. wrappers handle the complexity, one click, it works. then I needed the -ncmoe flag for MoE partial offload on my 4060 Ti and neither wrapper exposed it. so i compiled llama.cpp from source in WSL2. cmake, ninja, cuda toolkit, 40 minutes. 629 build targets, zero errors. turboquant KV cache types (turbo2, turbo3) exist in forks right now. wrappers get them weeks later, if ever. every flag is yours. -ncmoe, --cache-type-v turbo3, -DCMAKE_CUDA_ARCHITECTURES=89 targeting only your GPU’s architecture. the binary is smaller and faster because it’s built for your exact hardware. debugging is possible. when hermes decode dropped from 31 to 9 tok/s, I could trace it to graph splits jumping from 62 to 82. in a wrapper that’s a black box. ollama and LM Studio are the right starting point. once you’re running agents 24/7 and hitting limits, compile from source. the complexity is worth it because the control is real. ~~~ cmake -B build -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=89 -G Ninja cmake --build build -j$(nproc) ~~~

English

101

7.3K

witcheer ☯︎@witcheer·1d

@Pashoke1 thanks a lot!

English

Yaori@Pashoke1·1d

@witcheer 3.5k bookmarks is insane growth man

English

witcheer ☯︎@witcheer·1d

@Denis_skripnik @liquidai I agree

English

Denis Skripnik (blind) (✱,✱)@Denis_skripnik·1d

@witcheer @liquidai Wonderful. But reasoning is also important. So it’s unlikely to be suitable for intermediate and complex tasks.

English

witcheer ☯︎@witcheer·2d

I might have found my new local LLM, and I don't see a lot of people talking about it, if you have more data and context about this model please let me know: I tested LFM2 from @liquidai. that's a hybrid SSM + conv + MoE, only 10 of 40 layers use attention, the rest are state-space layers. 2.3B active params out of 24B total. my results: > 52.2 tok/s decode at 32K context > ncmoe 22 sweet spot (18 of 38 expert layers on GPU) > 13.4 GB Q4_K_M - lightest of all three-way head-to-head (same rig, same method, same quant): > LFM2: 52.2 tok/s - new champion > Qwen3.6 35B A3B: 35.4 tok/s > Gemma 4 26B A4B: 29.3 tok/s I then ran 12 stress tests beyond speed: JSON compliance, reasoning, coding, long context, multi-turn tracking, repetition. quality is solid for technical work, but it fails on creative diversity at temp=0 and cross-entity reasoning. but hey, on 8GB VRAM, it works pretty damn well! SSM layers need no KV cache, so context scaling is nearly free. 65K at 37.6 tok/s while Qwen drops to 17.4. data + methodology on HF: huggingface.co/datasets/witch…

English

Keşfet

@JakeKAllDay @JC_Midwest @beli_if @AnuranBuilds @tieuho2k7 @micheltamanda @elonmusk @BarackObama