korale

573 posts

korale banner
korale

korale

@korale77

Founder @Eclaire_Labs | Builder | ex-Microsoft

Katılım Ocak 2021
184 Takip Edilen167 Takipçiler
korale
korale@korale77·
Update - Added models: Marvis-TTS 100M/250M - Streaming for: CSM-1B, Qwen3-TTS 0.6B, Chatterbox Turbo. - Reduced streaming interval to 0.08s (single frame). Dropped TTFA dramatically. Fastest local voice pipeline: 71ms → 20ms (13ms STT + 7ms TTS).
korale tweet mediakorale tweet mediakorale tweet mediakorale tweet media
korale@korale77

MLX-Audio benchmarks on M5 Max 128GB @Prince_Canuma @lllucas 10 STT and 19 TTS models (3 runs each) Lowest latency: SenseVoiceSmall (13ms) + pocket-tts-4bit stream(59ms) = 71ms Fastest: SenseVoiceSmall (625×) + kitten-tts-nano (112×) vs real-time Repo with details in replies

English
0
1
2
109
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
Interesting video of M5 Max, on impact of Low, Automatic and High power modes on inference. - No external monitor attached - Model not relevant, but it's DS4 Flash Q2. Results: - Low ~25W ~12 toks/s - High ~120W ~ 32 toks/s - Automatic varies from 40W ~14 to 90W ~29 in relation to the fan speed and temperature of the Mac. If you really want to push your MacBook to the max, High Power mode and no external monitors, with them I see a very strange behavior that I'm investigating 🧐
English
16
6
110
12.3K
korale
korale@korale77·
Tried Marvis-TTS v0.2 100m but getting substantially higher numbers. Calling model.generate(text, stream=True), timer stops when the first chunk is yielded on a short prompt (7 words). Seems to be because each call to generate re-encodes the voice ref and rebuilds KV cache. If you have info on how you measured 12ms, let me know.
English
1
0
0
47
korale
korale@korale77·
MLX-Audio benchmarks on M5 Max 128GB @Prince_Canuma @lllucas 10 STT and 19 TTS models (3 runs each) Lowest latency: SenseVoiceSmall (13ms) + pocket-tts-4bit stream(59ms) = 71ms Fastest: SenseVoiceSmall (625×) + kitten-tts-nano (112×) vs real-time Repo with details in replies
korale tweet mediakorale tweet mediakorale tweet mediakorale tweet media
English
3
2
8
980
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
No please, no! Don't push people to think that a cluster of 4 M5 Max is a viable solution. If you set the to High Power they will drain battery after few hours even if connected. Moreover It will generate an incredible amount of heat and noise. For me is a no go, sorry.
Ivan Kuleshov@Merocle

M5 Max cluster 72 CPU and 128 GPU cores, 512GB unified Ram Each MacBook is connected to all the others with Thunderbolt 5 (120Gbit/s). But I’ll have to use Wi-Fi to connect to the cluster

English
37
10
166
25.7K
antirez
antirez@antirez·
DS4 is now called DwarfStar4, since you can put a lot of mass into a tiny space... And in a few minutes it is going to be much better on 128GB Macs because I'l pushing much better 2 bit quants generated with an in-house iMatrix magic recipe.
English
35
40
753
61.4K
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
M5 Max in Automatic instead of High Power is much slower, I see at least a 15% difference. So when you really want to push... High Power mode on!
English
8
0
62
6.8K
korale
korale@korale77·
Benchmarked DeepSeek V4 Flash (284B) @ q2 on MacBook Pro M5 Max 128GB using ds4.c from @antirez - 360 t/s prefill peak - 20-36 t/s generation across various context lengths - 128K context still going at 20 t/s!
korale tweet media
English
4
4
46
23.5K
korale
korale@korale77·
So many cool things to try out in this release!
Prince Canuma@Prince_Canuma

mlx-vlm v0.5.0 is here 🚀 This is the largest release ever 🙌🏽 → Continuous batching server + KV cache quantization → MTP and DFlash speculative decoding (single, batch, server) → Distributed inference: Qwen3.5, Kimi K2.5 & K2.6 → Prompt caching w/ warm-disk persistence → Gemma 4 video (multi-video) + MTP drafter @googlegemma → New models: Youtu-VL, Nemotron 3 Nano Omni, SAM 3D Body → Server: json_schema response_format, thinking mode flag Huge thanks to all 21 contributors and in particular the 18 new contributors, welcome aboard 🚢 Get started today: > uv pip install -U mlx-vlm Leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English
0
0
0
467
Prince Canuma
Prince Canuma@Prince_Canuma·
DeepSeek V4 MLX Quants now on MLX community HF repo, Made possible by @LambdaAPI and @TheZachMueller ❤️ Without a GPU cluster it would take me a week to upload the quants… Model collection 👇🏽
Prince Canuma@Prince_Canuma

DeepSeek-v4 now runs at ~23-26 tok/s on MLX! I made some custom kernels for the sinkhorn and it took gen speeds for 17 -> 26 tok/s. The weights are also significantly smaller thanks to @pcuenq tip about keeping the experts in MXFP4! Now you can use it to power your local coding agents (PI, Open code, Hermes agent or even CC) PR: github.com/ml-explore/mlx…

English
4
10
60
7.3K
korale retweetledi
Julien Chaumond
Julien Chaumond@julien_c·
This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥
Julien Chaumond tweet media
English
262
453
5.3K
650.1K
korale
korale@korale77·
Been using Presidio before. Have to try this!
Prince Canuma@Prince_Canuma

Congratulations @OpenAI on the release of privacy filter! It comes with day-0 support on MLX, now developers can run PII filtering and more completly on-device. PR will be merged in a couple minutes 🚀

English
0
0
1
52
korale retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…
Qwen tweet media
English
532
1.7K
12.5K
3.7M
korale
korale@korale77·
Took some time to run full NIAH sweep to get more data before replying. 490 runs across both Gemma 4 models, 8k to 200k context, 5 needle depths per cell. Short answer: the 13% rule was about multi-fact QA (Northwind). Single-needle NIAH is more forgiving on 31B: - TA-8192: 5/5 at 100k (8.2% ratio), cliff at 128k (6.4%) - TA-16384: 5/5 at 128k, still 4/5 at 200k - TA-32768: 5/5 everywhere through 200k Big factor is architecture. 31B's 10 full-attention layers tolerate pruning much better than 26B's 5. Same budget, 26B fails 3x earlier (TA-8192 breaks at 48k vs 128k).
English
0
0
0
106
Prince Canuma
Prince Canuma@Prince_Canuma·
@korale77 @no_stp_on_snek Yap, I got to the same conclusion with my benchmarks. Could you speak more on the NIAH benchmarks where TA worked when it was at least 13% of the context
Prince Canuma tweet media
English
1
0
0
31
korale
korale@korale77·
Benchmarked @Prince_Canuma's TriAttention + TurboQuant in MLX-VLM TriAttention: flat KV + stable decode from 8k → 200k context vs Baseline KV growth and ~50% slowdown. TurboQuant: 30-65% KV savings, no tuning, accuracy preserved. Repo w/ writeup, charts, and code in replies.
korale tweet mediakorale tweet mediakorale tweet media
English
4
3
26
4.7K
korale
korale@korale77·
31B (10 full-attn layers): - TA-8192 perfect to 100k, cliff at 128k - TA-16384 5/5 to 128k, 4/5 at 200k - TA-32768 perfect everywhere 26B (5 full-attn layers): - Same budgets fail ~3x earlier - Early positions (10-25%) drop first - Needs larger budgets to match 31B
English
1
0
1
127
korale
korale@korale77·
Follow-up on TurboQuant + TriAttention benchmarks in MLX-VLM. Ran full NIAH sweep on Gemma 4 31B and 26B-A4B. 490 runs, 8k–200k context, 5 needle positions each. Same TA budgets, 3x earlier failure. Full-attention layer count is the bottleneck.
korale tweet mediakorale tweet media
English
1
0
1
67