witcheer ☯︎

12.4K posts

witcheer ☯︎ banner
witcheer ☯︎

witcheer ☯︎

@witcheer

Head of Growth @YariFinance | Founder @Broad_Land | Prev @KPMG

huggingface.co/witcheer Katılım Ekim 2021
1.2K Takip Edilen9.6K Takipçiler
witcheer ☯︎
witcheer ☯︎@witcheer·
ollama 0.24.0 is out and wtf is their new desktop coding app called codex? built-in browser for annotating web pages. review mode for commenting and iterating on code. model switching between local and cloud. all running against whatever model you have pulled locally. recommended models from the release notes: kimi-k2.6 and glm-5.1 for hard coding and agentic tasks (cloud via ollama). nemotron-3-super, gemma4:31b, and qwen3.6 for running entirely local. also in this release: the MLX sampler got reworked for better generation quality on apple silicon. if your mac mini outputs have felt slightly off compared to the same model on CUDA, this is the fix. github.com/ollama/ollama/…
English
1
0
2
64
Jake
Jake@JakeKAllDay·
@witcheer FYI I know you've been trying out moe offload as well
English
1
0
0
12
Jake
Jake@JakeKAllDay·
A fun finding for the 12GB VRAM + MOE RAM Offload gang: 60 tok/s sustained with Qwen 3.6 35a3b! (MTP +Turboquant and a few other tricks) 🧵
Jake tweet media
English
4
2
4
58
witcheer ☯︎
witcheer ☯︎@witcheer·
This morning I was looking for a Codex-ish app for Claude Code, is there really nothing?? I’ve been loyal to Anthropic for a year now but I’ve never been this close to switching
English
0
0
0
151
Jay | The Midwest Dad |
@witcheer Have you tried squeezing in a 27b model? I realize not really ideal, but found 3.5 qwen surprisingly functional if you don’t mind waiting a bit lol
English
1
0
1
66
witcheer ☯︎
witcheer ☯︎@witcheer·
tested Mistral 7B v0.3 Instruct on my RTX 4060 Ti 8GB. 7.3B params, all active on every token. dense transformer, no MoE tricks. 4.1 GB on disk (Q4_K_M). >56.4 tok/s at baseline. degrades to 26.4 at 32K context (-53%), but no hard cliff. the interesting part is quality. I ran 6 targeted tests (JSON, code gen, logic, system prompt adherence, hallucination resistance, format switching). passed 5/6: best score in my test set. the extra params show, especially on reasoning tasks. however, it eats 6.7 GB VRAM at idle. only 1.5 GB free. every parameter is active, no offloading tricks to play. you get the full model or nothing (could have missed something tho). no thinking mode, no reasoning_content overhead. every token is content. simple to use with a local agent. full results with prompt/response traces: huggingface.co/datasets/witch…
witcheer ☯︎ tweet media
English
2
2
29
1.5K
If Beli
If Beli@beli_if·
@witcheer Keep sharing brother :) Right now you are my most favorite twitter account.
English
1
0
1
3
witcheer ☯︎
witcheer ☯︎@witcheer·
Seeing 3.5k bookmarks in just 2 weeks means more to me than any impression I could make. It shows y’all find my content interesting enough to bookmark and come back to later when you have more time. I really appreciate it!
witcheer ☯︎ tweet media
English
6
0
13
386
Jake
Jake@JakeKAllDay·
@witcheer What's the MT/s on your RAM? DDR4 yea?
English
1
0
2
128
tieuho2k7
tieuho2k7@tieuho2k7·
@witcheer can u share cli command to run it with params?
English
1
0
0
320
AJ
AJ@ItsmeAjayKV·
@witcheer Oh that's crazy stats, but not surprising considering you're sharing valuable information. Keep on going !
English
1
0
1
32
DZ
DZ@prodbitz·
@witcheer Have you test qwen3.6-27b ?
English
1
0
0
84
witcheer ☯︎
witcheer ☯︎@witcheer·
I used to run everything through ollama and LM Studio. wrappers handle the complexity, one click, it works. then I needed the -ncmoe flag for MoE partial offload on my 4060 Ti and neither wrapper exposed it. so i compiled llama.cpp from source in WSL2. cmake, ninja, cuda toolkit, 40 minutes. 629 build targets, zero errors. turboquant KV cache types (turbo2, turbo3) exist in forks right now. wrappers get them weeks later, if ever. every flag is yours. -ncmoe, --cache-type-v turbo3, -DCMAKE_CUDA_ARCHITECTURES=89 targeting only your GPU’s architecture. the binary is smaller and faster because it’s built for your exact hardware. debugging is possible. when hermes decode dropped from 31 to 9 tok/s, I could trace it to graph splits jumping from 62 to 82. in a wrapper that’s a black box. ollama and LM Studio are the right starting point. once you’re running agents 24/7 and hitting limits, compile from source. the complexity is worth it because the control is real. ~~~ cmake -B build -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=89 -G Ninja cmake --build build -j$(nproc) ~~~
English
6
7
101
7.3K
Yaori
Yaori@Pashoke1·
@witcheer 3.5k bookmarks is insane growth man
English
1
0
1
16
witcheer ☯︎
witcheer ☯︎@witcheer·
I might have found my new local LLM, and I don't see a lot of people talking about it, if you have more data and context about this model please let me know: I tested LFM2 from @liquidai. that's a hybrid SSM + conv + MoE, only 10 of 40 layers use attention, the rest are state-space layers. 2.3B active params out of 24B total. my results: > 52.2 tok/s decode at 32K context > ncmoe 22 sweet spot (18 of 38 expert layers on GPU) > 13.4 GB Q4_K_M - lightest of all three-way head-to-head (same rig, same method, same quant): > LFM2: 52.2 tok/s - new champion > Qwen3.6 35B A3B: 35.4 tok/s > Gemma 4 26B A4B: 29.3 tok/s I then ran 12 stress tests beyond speed: JSON compliance, reasoning, coding, long context, multi-turn tracking, repetition. quality is solid for technical work, but it fails on creative diversity at temp=0 and cross-entity reasoning. but hey, on 8GB VRAM, it works pretty damn well! SSM layers need no KV cache, so context scaling is nearly free. 65K at 37.6 tok/s while Qwen drops to 17.4. data + methodology on HF: huggingface.co/datasets/witch…
witcheer ☯︎ tweet media
English
12
9
93
7K