payne

361 posts

payne banner
payne

payne

@____payne_____

opinions = my own

Katılım Aralık 2018
1.8K Takip Edilen1.1K Takipçiler
payne
payne@____payne_____·
@bstnxbt Since when is 8k context “long” lol
English
0
0
0
161
bstn 👁️
bstn 👁️@bstnxbt·
DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4 tok/s (1.81x) ► @ 4096 · 134.5 → 208.4 tok/s (1.56x) ► @ 8192 · 133.2 → 177.4 tok/s (1.33x) Qwen3.5-27B-4bit: ► @ 1024 · 33.5 → 79.0 tok/s (2.37x) ► @ 2048 · 33.1 → 70.2 tok/s (2.12x) ► @ 4096 · 31.5 → 55.7 tok/s (1.77x) ► @ 8192 · 33.9 → 45.3 tok/s (1.34x) Working on making this usable for agentic workloads goal is to never drop below baseline at any context depth. LLM decode is memory-bandwidth bound. M5 Max runs at 614 GB/s, that's 1.5x more than M1-M4 Max (400-410 GB/s). Results will vary on lower bandwidth chips.
English
22
33
272
23.2K
payne
payne@____payne_____·
@leopardracer 16k ctx is basically a goldfish ai lol
English
2
0
7
612