Seba

361 posts

Seba

Seba

@CulStory

TLDR Current proj: Realtime Text-to-Speech on Apple NPU https://t.co/kJWvH9C7TK

Katılım Temmuz 2011
2.3K Takip Edilen293 Takipçiler
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
MLX DeepSeek-V4-Flash-2bit-DQ MLX 4K context issue solved! Benchmark results on Apple M5 Max, 128.0GB RAM, 18 CPU cores, 40 GPU cores A comparison M3 Ultra vs M5 Max including bath performance will follow shortly. 0.5k pp 446 tg 42 t/s mem 97.8GB kv 0.02GB 1k pp 578 tg 42 t/s mem 98.1GB kv 0.02GB 2k pp 622 tg 40 t/s mem 99.2GB kv 0.03GB 4k pp 570 tg 37 t/s mem 100.7GB kv 0.04GB 8k pp 513 tg 37 t/s mem 101.4GB kv 0.06GB 16k pp 390 tg 37 t/s mem 102.7GB kv 0.12GB 32k pp 343 tg 36 t/s mem 104.5GB kv 0.23GB 64k pp 297 tg 34 t/s mem 109.4GB kv 0.45GB This is using this PR from @0xClandestine 🔥 It's faster than yesterday! I bet it's using matmul in hardware much more. github.com/Blaizzy/mlx-lm…
Ivan Fioravanti ᯅ tweet mediaIvan Fioravanti ᯅ tweet media
English
11
15
139
13.1K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
I've been rereading V4 paper too, and getting dizzy again from how insane it is. And you know what's funny. The real problem they were solving, the "pivotal design goal of V4" – at any cost! – was not "1M context", it was *batch invariance*. Opus proposes a conspiracy theory:
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Arthur Zucker@art_zucker

Reading @deepseek_ai 's v4 paper.... absolute hats off. Every problem has a mathematical solution, nothing is left to chance. I have so much respect for them, putting out months or years of efforts entirely for free, in the open for anyone to benefit. Real goats 🫡

English
10
40
638
73.5K
Seba
Seba@CulStory·
the most important part about v4 flash/pro. you can probably serve 100s of users at >100k context each on a single gpu/node.
Seba tweet media
English
0
0
2
203
Seba
Seba@CulStory·
@Is36E @huggingface you should look into making hf and mlx weights compatible, hate having to download both
English
0
0
3
358
Isalia20
Isalia20@Is36E·
This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)
Isalia20 tweet media
English
6
7
154
12.3K
Seba
Seba@CulStory·
@maderix hopefully someday will allow to implement flash attention on their npu, for now you have to chunk it
Seba tweet mediaSeba tweet media
English
0
0
1
185
maderix
maderix@maderix·
Sigmoid Self attention runs quiet well on ANE , hits almost 90% fp16 peak via ANE private APIs. While sigmoid can't be naively used ,with normalisation tricks it probably can be done. Currently investigating if Softmax self attention can be improved as well
maderix tweet media
English
1
0
2
253
Seba
Seba@CulStory·
@PrismML we need a per-channel quant to make it run on npu 🙃
English
0
0
0
314
PrismML
PrismML@PrismML·
Today we’re announcing Ternary Bonsai: Top intelligence at 1.58 bits Using ternary weights {-1, 0, +1}, we built a family of models that are 9x smaller than their 16-bit counterparts while outperforming most models in their respective parameter classes on standard benchmarks. We’re open-sourcing the models under the Apache 2.0 license in three sizes: 8B (1.75 GB), 4B (0.86 GB), and 1.7B (0.37 GB).
PrismML tweet media
English
117
306
2.2K
473.3K
Seba
Seba@CulStory·
@ronaldmannak idk if this still the case, but some time ago, qmv (small batch size mm) added latency in m1/m2 so i made a custom kernel for that, may be useful for batched requests. github.com/0seba/mlx-eagl…
English
1
0
2
1.1K
Ronald Mannak
Ronald Mannak@ronaldmannak·
Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!
English
11
23
352
47.4K
Prince Canuma
Prince Canuma@Prince_Canuma·
@runsonai @liranringel That’s awesome! I tried ddtree yesterday on MLX-VLM but didn’t see a significant speed up, maybe I missed something
English
2
0
4
806
Thanh Pham
Thanh Pham@runsonai·
Currently porting @liranringel ddtree to accelerate speculative decoding on mlx (apple). Looking very promising on first test.
Thanh Pham tweet media
English
4
0
11
1.4K
Seba
Seba@CulStory·
@anemll for ddtree compute compute grows exponentially to achieve sub-linear speedup. i like dflash, but imo beyond that point it is better to use the compute on higher quality tokens rather than more tokens
English
0
0
1
151
Seba
Seba@CulStory·
@danveloper not so sure about lower power, slow disk storage uses a lot more power and heat is an additional issue
English
2
0
0
53
Dan Woods
Dan Woods@danveloper·
Somehow this remains an easily overlooked aspect of the flash-moe strategy... you can run a 26 billion parameter model in a very very small memory footprint. You pay with token throughput, but that might be an acceptable tradeoff. Lower power, less RAM, bigger models.
Anemll@anemll

@Alexey_CA @twostraws @jeremyphoward This is what Flash-MoE is trying to address: running in low RAM environments. This runs 26B in a 3GB footprint. Improving iPhone helps with M5Max 128GB and M3U optimizations for me.

English
1
1
8
1.7K
Demis Hassabis
Demis Hassabis@demishassabis·
Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!)
Demis Hassabis tweet media
English
145
240
3K
216.2K