korale

573 posts

korale

@korale77

Founder @Eclaire_Labs | Builder | ex-Microsoft

Katılım Ocak 2021

184 Takip Edilen167 Takipçiler

korale@korale77·1d

Update - Added models: Marvis-TTS 100M/250M - Streaming for: CSM-1B, Qwen3-TTS 0.6B, Chatterbox Turbo. - Reduced streaming interval to 0.08s (single frame). Dropped TTFA dramatically. Fastest local voice pipeline: 71ms → 20ms (13ms STT + 7ms TTS).

korale@korale77

MLX-Audio benchmarks on M5 Max 128GB @Prince_Canuma @lllucas 10 STT and 19 TTS models (3 runs each) Lowest latency: SenseVoiceSmall (13ms) + pocket-tts-4bit stream(59ms) = 71ms Fastest: SenseVoiceSmall (625×) + kitten-tts-nano (112×) vs real-time Repo with details in replies

English

109

korale@korale77·2d

@ivanfioravanti Good info!

English

Ivan Fioravanti ᯅ@ivanfioravanti·2d

Interesting video of M5 Max, on impact of Low, Automatic and High power modes on inference. - No external monitor attached - Model not relevant, but it's DS4 Flash Q2. Results: - Low ~25W ~12 toks/s - High ~120W ~ 32 toks/s - Automatic varies from 40W ~14 to 90W ~29 in relation to the fan speed and temperature of the Mac. If you really want to push your MacBook to the max, High Power mode and no external monitors, with them I see a very strange behavior that I'm investigating 🧐

English

110

12.3K

korale@korale77·2d

Tried Marvis-TTS v0.2 100m but getting substantially higher numbers. Calling model.generate(text, stream=True), timer stops when the first chunk is yielded on a short prompt (7 words). Seems to be because each call to generate re-encodes the voice ref and rebuilds KV cache. If you have info on how you measured 12ms, let me know.

English

Prince Canuma@Prince_Canuma·2d

@korale77 @lllucas Marvis-TTS is 12 ms :)

English

591

korale@korale77·2d

English

980

korale@korale77·2d

@Prince_Canuma @lllucas Oh thanks! How did I miss that one! Downloading...

English

korale@korale77·2d

@Prince_Canuma @lllucas github.com/korale77/mlx-a…

QME

korale@korale77·2d

@ivanfioravanti Agree!

English

Ivan Fioravanti ᯅ@ivanfioravanti·2d

No please, no! Don't push people to think that a cluster of 4 M5 Max is a viable solution. If you set the to High Power they will drain battery after few hours even if connected. Moreover It will generate an incredible amount of heat and noise. For me is a no go, sorry.

Ivan Kuleshov@Merocle

M5 Max cluster 72 CPU and 128 GPU cores, 512GB unified Ram Each MacBook is connected to all the others with Thunderbolt 5 (120Gbit/s). But I’ll have to use Wi-Fi to connect to the cluster

English

166

25.7K

korale@korale77·3d

@antirez Will have to try that!

English

703

antirez@antirez·3d

DS4 is now called DwarfStar4, since you can put a lot of mass into a tiny space... And in a few minutes it is going to be much better on 128GB Macs because I'l pushing much better 2 bit quants generated with an in-house iMatrix magic recipe.

English

753

61.4K

korale@korale77·4d

@OrganicGPT @ivanfioravanti yep

Behnam@OrganicGPT·4d

@korale77 @ivanfioravanti why disconnect the monitor? to use less GPU for driving the displays?

English

Ivan Fioravanti ᯅ@ivanfioravanti·4d

M5 Max in Automatic instead of High Power is much slower, I see at least a 15% difference. So when you really want to push... High Power mode on!

English

6.8K

korale@korale77·6d

Benchmarked DeepSeek V4 Flash (284B) @ q2 on MacBook Pro M5 Max 128GB using ds4.c from @antirez - 360 t/s prefill peak - 20-36 t/s generation across various context lengths - 128K context still going at 20 t/s!

English

23.5K

korale@korale77·7 May

So many cool things to try out in this release!

Prince Canuma@Prince_Canuma

mlx-vlm v0.5.0 is here 🚀 This is the largest release ever 🙌🏽 → Continuous batching server + KV cache quantization → MTP and DFlash speculative decoding (single, batch, server) → Distributed inference: Qwen3.5, Kimi K2.5 & K2.6 → Prompt caching w/ warm-disk persistence → Gemma 4 video (multi-video) + MTP drafter @googlegemma → New models: Youtu-VL, Nemotron 3 Nano Omni, SAM 3D Body → Server: json_schema response_format, thinking mode flag Huge thanks to all 21 contributors and in particular the 18 new contributors, welcome aboard 🚢 Get started today: > uv pip install -U mlx-vlm Leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English

467

korale@korale77·27 Nis

Encouraging to hear that even at 2 bit it’s still working well.

antirez@antirez

This is DeepSeek v4 Flash quantized at 2 bit that runs as LLM of the pi agent. Perfect tool calling apparently, so this model, with this specific quantization scheme that I used at least, is capable of working very well. Now I need a real speedup not in t/s generation but prompt processing.

English

korale@korale77·25 Nis

@Prince_Canuma @LambdaAPI @TheZachMueller So quickly!

English

Prince Canuma@Prince_Canuma·25 Nis

DeepSeek V4 MLX Quants now on MLX community HF repo, Made possible by @LambdaAPI and @TheZachMueller ❤️ Without a GPU cluster it would take me a week to upload the quants… Model collection 👇🏽

Prince Canuma@Prince_Canuma

DeepSeek-v4 now runs at ~23-26 tok/s on MLX! I made some custom kernels for the sinkhorn and it took gen speeds for 17 -> 26 tok/s. The weights are also significantly smaller thanks to @pcuenq tip about keeping the experts in MXFP4! Now you can use it to power your local coding agents (PI, Open code, Hermes agent or even CC) PR: github.com/ml-explore/mlx…

English

7.3K

korale retweetledi

Julien Chaumond@julien_c·24 Nis

This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥

English

262

453

5.3K

650.1K

korale@korale77·24 Nis

Looking great

DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English

korale@korale77·23 Nis

Been using Presidio before. Have to try this!

Prince Canuma@Prince_Canuma

Congratulations @OpenAI on the release of privacy filter! It comes with day-0 support on MLX, now developers can run PII filtering and more completly on-device. PR will be merged in a couple minutes 🚀

English

korale retweetledi

Qwen@Alibaba_Qwen·22 Nis

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

532

1.7K

12.5K

3.7M

korale@korale77·20 Nis

Took some time to run full NIAH sweep to get more data before replying. 490 runs across both Gemma 4 models, 8k to 200k context, 5 needle depths per cell. Short answer: the 13% rule was about multi-fact QA (Northwind). Single-needle NIAH is more forgiving on 31B: - TA-8192: 5/5 at 100k (8.2% ratio), cliff at 128k (6.4%) - TA-16384: 5/5 at 128k, still 4/5 at 200k - TA-32768: 5/5 everywhere through 200k Big factor is architecture. 31B's 10 full-attention layers tolerate pruning much better than 26B's 5. Same budget, 26B fails 3x earlier (TA-8192 breaks at 48k vs 128k).

English

106

Prince Canuma@Prince_Canuma·16 Nis

@korale77 @no_stp_on_snek Yap, I got to the same conclusion with my benchmarks. Could you speak more on the NIAH benchmarks where TA worked when it was at least 13% of the context

English

korale@korale77·16 Nis

Benchmarked @Prince_Canuma's TriAttention + TurboQuant in MLX-VLM TriAttention: flat KV + stable decode from 8k → 200k context vs Baseline KV growth and ~50% slowdown. TurboQuant: 30-65% KV savings, no tuning, accuracy preserved. Repo w/ writeup, charts, and code in replies.

English

4.7K

korale@korale77·20 Nis

Updated the repo with full writeup, heatmaps, position charts, and both benchmark scripts: github.com/korale77/mlx-v…

English

korale@korale77·20 Nis

31B (10 full-attn layers): - TA-8192 perfect to 100k, cliff at 128k - TA-16384 5/5 to 128k, 4/5 at 200k - TA-32768 perfect everywhere 26B (5 full-attn layers): - Same budgets fail ~3x earlier - Early positions (10-25%) drop first - Needs larger budgets to match 31B

English

127

korale@korale77·20 Nis

Follow-up on TurboQuant + TriAttention benchmarks in MLX-VLM. Ran full NIAH sweep on Gemma 4 31B and 26B-A4B. 490 runs, 8k–200k context, 5 needle positions each. Same TA budgets, 3x earlier failure. Full-attention layer count is the bottleneck.

English

Keşfet

@ivanfioravanti @lllucas @Prince_Canuma @antirez @OrganicGPT @LambdaAPI @TheZachMueller @elonmusk