Malek Ould-Oulhadj (@malekoo) - Twitter Profili

Sabitlenmiş Tweet

We've been building a native MLX macOS app that runs large language models entirely on your Mac. No cloud. No API keys. No subscriptions. Fully private and secure. Think Qwen 3.5 models running locally, optimized for full agentic workflows with tool integration: Calendar, Reminders, Files, Music, Web Search, Home automation, and more. Built for Apple Silicon. LLMs run fully on-device with Swift MLX. Multiple ways to interact with your local model: the Mac app directly, a paired iPhone remote app, or server APIs for custom integrations. We're getting close to our first release on the Mac App Store and looking for beta testers. Drop a reply or DM if you want early access.

English

1

2

167

Malek Ould-Oulhadj@malekoo·18h

Gemma 4 MLX benchmarks on M3 Ultra, fully local. 26B MoE: 104 tok/s, 3.3x faster than the 31B dense (32 tok/s) 286 vs 105 tok/s prefill. 3 GB less memory. Outstanding work by @GoogleDeepMind on Gemma 4. The MoE architecture is genuinely impressive, quality and efficiency at this level running fully on-device. Thanks to @Prince_Canuma for the MLX conversions. Benchmarked here with Python mlx-vlm. We're also building a macOS app powered by MLX Swift with agentic tool calling, currently in development. More tests soon.

English

0

1

62

Malek Ould-Oulhadj retweetledi

Ivan Fioravanti ᯅ@ivanfioravanti·1d

Hybrid AI is the way to go! 🤷🏻‍♂️ I've been running LLMs locally on Apple Silicon for over two years now. In @CoreViewHQ, in my investment company and for my family usage. Inference, classification, clustering, fine-tuning, agentic coding, chat all on Mac Studios that sit on my desk. No cloud dependency. No data leaving the building. Total privacy. 💪 With M5 Ultra pushing unified memory further, the gap between "local" and "data center" will be even lower. Most companies don't need a GPU cluster. They need a Mac Studio and the right stack of software to make the magic happen. Apple's AI play isn't Siri. It's Apple Silicon. Are you ready? 🚀

English

7

8

137

7.3K

Malek Ould-Oulhadj@malekoo·4d

🚀 Qwen3.5-Omni just landed; and Qwen3.6 could be next!

English

0

3

76

Malek Ould-Oulhadj@malekoo·26 Mar

@mweinbach @ivanfioravanti Great minds think alike. Like a swarm, no central coordination needed. Individual contributors solving pieces of the puzzle independently and the collective intelligence just emerges. This is how we bring frontier AI to local hardware for everyone.

English

0

16

Max Weinbach@mweinbach·26 Mar

@ivanfioravanti or if we can get MTP/spec decode working

English

2

0

3

286

Ivan Fioravanti ᯅ@ivanfioravanti·26 Mar

MLX: what if we ask Gemini CLI & 3.1 Pro Preview to create a custom metal kernel for qwen3.5 moe using Morton order, accumulation loop synchronization and more suggestions from Metal Performance Primitives (MPP) Programming Guide? MAGIC in bf16! 🔥 Int8 and benchmarks tomorrow!

English

9

3

51

3.6K

Malek Ould-Oulhadj@malekoo·26 Mar

For production single model apps, yes. When you strip the multi-architecture abstraction layer and know your exact shapes, head dims, expert count, and attention pattern at compile time, custom kernels become the obvious next step. Generic frameworks optimize for flexibility. Custom kernels optimize for the one model you actually ship.

English

0

16

Max Weinbach@mweinbach·26 Mar

@ivanfioravanti i wonder if the new thing is going to be custom kernels for each model

English

3

0

4

689

Malek Ould-Oulhadj@malekoo·26 Mar

This is exactly the direction. Qwen3.5's hybrid architecture (GatedDeltaNet + MoE + full attention) means generic kernels leave performance on the table. When you know the exact layer types, head dimensions, expert routing and tensor shapes, you can tune Metal threadgroup sizes and memory access patterns specifically for that model. Waiting on those int8 benchmarks.

English

0

34

Malek Ould-Oulhadj@malekoo·26 Mar

There is a lot of LLM intelligence spent adapting traditional tools to the AI era, developing apps overnight by frontier labs to kill competition, and chasing money-making trends. Instead, this intelligence should be fulfilling its long-promised potential for humanity: health and wellbeing research, learning how to evolve in harmony together, and focusing on breakthroughs with what we’ve already achieved rather than trying to kill others’ creativity and coexistence.

English

0

1

45

Malek Ould-Oulhadj@malekoo·25 Mar

@Prince_Canuma In the case of Qwen3.5 models where 75% of layers are GatedDeltaNet with no KV cache and 25% are full attention, what do the total inference memory savings look like when TurboQuant only applies to those full attention layers? Also, TurboQuant on MLX same day as Google's paper. Thank you for this 🙏

English

0

647

Google Research@GoogleResearch·24 Mar

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

1K

5.8K

39K

19.1M

Malek Ould-Oulhadj@malekoo·25 Mar

Great question. Every AI app brings its own flavor and experience to local models, and that diversity is a win for everyone. More options means you find what fits your workflow best, and there's no reason you can't use more than one. What we're focused on is the assistant experience: built-in integrations with your Mac (Calendar, Reminders, Files, Music, Web Search, Home automation), a first-party iPhone companion app, and a design that's optimized around daily workflows rather than tinkering. You open it and it just works. The local AI space is better when there are many great tools to choose from.

English

0

1

34

Vincent Hopf@VincentHopf·25 Mar

@malekoo Why would I use this over LMStudio?

English

1

0

71

Malek Ould-Oulhadj@malekoo·24 Mar

We've been building a native MLX macOS app that runs large language models entirely on your Mac. No cloud. No API keys. No subscriptions. Fully private and secure. Think Qwen 3.5 models running locally, optimized for full agentic workflows with tool integration: Calendar, Reminders, Files, Music, Web Search, Home automation, and more. Built for Apple Silicon. LLMs run fully on-device with Swift MLX. Multiple ways to interact with your local model: the Mac app directly, a paired iPhone remote app, or server APIs for custom integrations. We're getting close to our first release on the Mac App Store and looking for beta testers. Drop a reply or DM if you want early access.

English

1

2

167

Malek Ould-Oulhadj@malekoo·25 Mar

Google's TurboQuant is a genuine breakthrough for local AI. Here's what it actually does and doesn't do, since there's been some confusion. When an LLM processes your conversation, it stores keys and values for every token it's seen. That's called the KV cache, and it grows fast. Long conversations eat memory quickly, slow down, and eventually hit a wall. TurboQuant shrinks that KV cache from 16 bits down to about 3 bits per value with minimal to zero quality loss. No retraining. No new model files. Just smarter compression at inference time. The paper demonstrates a 6x reduction in KV cache memory. What it does: → 6x smaller KV cache memory footprint → Significantly longer conversations before running out of memory → Works with your existing models, nothing to re-download What it doesn't do: → Make models smaller to download or load. A 32GB model still needs 32GB. The real story: if you run models locally, your biggest constraint after loading the model is how much context you can fit before memory runs out. TurboQuant pushes that limit way back. Credit to Google for publishing this openly instead of keeping it internal. This benefits everyone running local AI.

English

0

1

70

Malek Ould-Oulhadj@malekoo·25 Mar

Small but important distinction: TurboQuant doesn't make models 6x smaller or 8x faster across the board. It compresses KV caches (the memory used to hold context during generation) to ~3 bits, giving you roughly 4-6x more context in the same memory. The 8x speedup was on attention computation specifically, on H100 GPUs. Your Mac Mini won't suddenly run models it couldn't before, but it will handle much longer conversations with models it already runs. That's the real unlock for local inference. Agreed on Google though. Publishing this openly is a big deal.

English

0

4

523

Prajwal Tomar@PrajwalTomar_·25 Mar

Google just dropped TurboQuant, and I'm about to get so much more out of my Mac Mini now lol. It makes LLMs 6x smaller and 8x faster with zero quality loss. Now I can run insane AI models locally for free. → Bigger context windows → Way faster processing → Completely secure Google could have kept this to themselves. They didn't. Huge respect for pushing the entire industry forward. We're just 3 months into 2026 and so much has been happening already. If you're not paying attention, you're already behind.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

31

47

702

156.2K

Malek Ould-Oulhadj@malekoo·25 Mar

One thing to note: TurboQuant compresses KV caches (the memory used during generation), not the model weights themselves. It doesn't make models smaller to download or load. A model that needs 32GB of RAM to load still needs 32GB. The real win is context length. By compressing key/value caches to ~3 bits, you can fit roughly 4x more context in the same memory. That's genuinely huge for local inference, where memory is the bottleneck for long conversations and complex tasks. Props to Google for publishing this openly. This is a real breakthrough for anyone running models locally.

English

0

128

Alex Finn@AlexFinn·25 Mar

This is potentially the biggest news of the year Google just released TurboQuant. An algorithm that makes LLM’s smaller and faster, without losing quality Meaning that 16gb Mac Mini now can run INCREDIBLE AI models. Completely locally, free, and secure This also means: • Much larger context windows possible with way less slowdown and degradation • You’ll be able to run high quality AI on your phone • Speed and quality up. Prices down. The people who made fun of you for buying a Mac Mini now have major egg on their face. This pushes all of AI forward in a such a MASSIVE way It can’t be stated enough: props to Google for releasing this for all. They could have gatekept it for themselves like I imagine a lot of other big AI labs would have. They didn’t. They decided to advance humanity. 2026 is going to be the biggest year in human history.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

332

879

9.7K

1.5M

Malek Ould-Oulhadj@malekoo·25 Mar

@MatthewBerman @Prince_Canuma TurboQuant compresses KV caches, so the primary win is longer context at the same memory budget. Speed gains depend on kernel maturity. Nice to see 2x on a 5060 already! Should improve further as implementations get fused attention and Hadamard rotation optimizations.

English

0

146

Matthew Berman@MatthewBerman·25 Mar

@Prince_Canuma What does this translate to in terms of t/s improvement?

English

7

0

36

15.5K

Prince Canuma@Prince_Canuma·25 Mar

Just implemented Google’s TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x smaller KV cache The best part: Zero accuracy loss compared to full KV cache.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

147

411

5.2K

719.6K

Malek Ould-Oulhadj@malekoo·19 Mar

@awnihannun Congrats Awni!

English

0

1

85

Awni Hannun@awnihannun·19 Mar

I joined Anthropic as a member of the technical staff. Excited to work on frontier modeling at a place with unwavering values and a generational mission.

English

208

38

2.3K

120.8K

Malek Ould-Oulhadj@malekoo·9 Mar

@Prince_Canuma @arcee_ai Congratulations!

English

0

13

Prince Canuma@Prince_Canuma·11 Eyl

Exciting News: FastMLX and I are Joining @arcee_ai! 🎉 This move marks an exciting new chapter in my journey to advance ML research, development and production. The Incredible Team 👨🏽‍💻 I'm honored to be part of an exceptional team of superstars at Arcee – some of the brightest and most dedicated professionals I've had the privilege to work with. A special shoutout to @markmcquade, Jacob, @FernandoNetoAi , @latkins , @MaziyarPanahi , @qnguyen3, and the entire Arcee team for their warm welcome and shared vision. The Focus 🎯 Since the beginning of this year, I've been dedicated to advancing the field of machine learning, with a particular emphasis on MLX and MLOps. My efforts have centered on several key areas: 1. Training 2. Pruning techniques 3. Inference Optimisation (cloud and on-device) 4. MLOps and more. Regarding MLX the objectives were to expand its capabilities, pioneer new model architectures, and create a robust ecosystem of tools and frameworks. This work is driven by a vision to empower developers like you, enabling the creation of groundbreaking native AI applications specifically optimized for Apple Silicon on platforms such as MacOS and IOS. What This Means for FastMLX? 😎 FastMLX will remain open-source for the foreseeable future. I'll be leading its next phase of development at Arcee, backed by increased resources and the support of our fantastic team. We're excited to continue collaborating with the open-source community that has been instrumental in FastMLX's success. The Future of MLX 🚀 My commitment to MLX remains stronger than ever. I'll continue contributing to MLX with: - New models - Innovative tools - Continuous improvements Rest assured, the MLX King lives on! 👑 Gratitude ❤️ A heartfelt thank you to my parents for their unwavering support and encouragement since I embarked on my ML journey in 2017. Your belief in me has been invaluable. Special thanks to @awnihannun, @vietdle, @nlauchande and the entire ML and MLX community for your support and inspiration. Stay tuned for more exciting developments as we push the boundaries of ML together!

English

46

7

153

12K

Malek Ould-Oulhadj@malekoo·3 Mar

One of those products here. We're building a fully local Mac app powered entirely by Qwen. LLM, TTS, and vision, all running on-device through MLX. The architecture maps so naturally to Apple Silicon it feels native to macOS. MoE + unified memory is a perfect pairing. Grateful to the Qwen team and to you Awni for MLX. The combination is something special.

English

2

0

10

908

Awni Hannun@awnihannun·3 Mar

I remember when Qwen 1.0 came out (fall 2023, not that long ago!) and we added support to mlx-lm. And they didn't stop releasing models, every one pushing the frontier of open-weights. @JustinLin610 always reached out to make sure the new models were well supported in MLX. I don't know how many research papers were written thanks to Qwen, hundreds, maybe thousands. I don't know how many products or startups are being built thanks to Qwen. Probably a lot. Thanks @JustinLin610, @huybery and the rest of the Qwen team for your contributions to AI.

English

11

26

310

15.7K

Malek Ould-Oulhadj@malekoo·28 Şub

@awnihannun Really sorry to see you go, Awni. Thank you for everything you and the team built. MLX is incredible! Wishing you all the best in the next chapter 🙌

English

0

1

333

Awni Hannun@awnihannun·28 Şub

Today is my last day at Apple. Building MLX with our amazing team and community has been an absolute pleasure. It's still early days for AI on Apple silicon. Apple makes the best consumer hardware on the planet. There's so much potential for it to be the leading platform for AI. And I'm confident MLX will continue to have a big role in that. To the future: MLX remains in the exceptionally capable hands of our team including @angeloskath, @zcbenz, @DiganiJagrit, @NasFilippova, @trebolloc (and others not on X). Follow them or @shshnkp for future updates.