witcheer ☯︎
12.4K posts

witcheer ☯︎
@witcheer
Head of Growth @YariFinance | Founder @Broad_Land | Prev @KPMG







Gemma 4 26B A4B on RTX 4060 Ti 8GB. tested it. > 29.3 tok/s decode at 32K context > ncmoe 23 sweet spot (7 of 30 expert layers on GPU) > 16 GB Q4_K_M - fits comfortably in 32GB RAM > 490 MiB VRAM headroom vs Qwen's razor-thin 37 head-to-head vs Qwen3.6 35B A3B (same rig, same method): > Qwen: 35.4 tok/s at 32K - faster raw decode > Gemma: 29.3 tok/s at 32K - 6 GB lighter on disk > at 65K: Gemma 25.8 tok/s, Qwen 17.4 tok/s - Qwen hits the VRAM cliff, Gemma doesn't the sliding window attention (5:1 SWA-to-full pattern) does for context scaling what Qwen's hybrid SSM does: keeps KV cache growth sublinear. data + methodology on HF: huggingface.co/datasets/witch…














