Akram Shehadi
107 posts











Took me a bit to understand how to do the local setup, so here's a guide: 1) Install oMLX Download the Apple Silicon macOS build of oMLX from the official oMLX releases/site 2) Create a Hugging Face account and a read-only token 3) Install the Hugging Face CLI pip install -U huggingface_hub Then log in interactively: hf auth login 4) Know where oMLX stores models By default, oMLX uses: ~/.omlx/models I tested 2 model options: - Option A: Qwen3.6 + DFlash draft model - Option B: Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit + Qwen3.5-4B-MLX-4bit 5A) Download the correct model pair Use the 4-bit MLX base: hf download mlx-community/Qwen3.6-35B-A3B-4bit \ --local-dir ~/.omlx/models/Qwen3.6-35B-A3B-4bit Draft model Use the DFlash draft: hf download z-lab/Qwen3.6-35B-A3B-DFlash \ --local-dir ~/.omlx/models/Qwen3.6-35B-A3B-DFlash Note: Qwen/Qwen3.6-35B-A3B setup previously failed with: - model size: ~70 GB - oMLX max model memory: ~23 GB Because I only have 32GB RAM. So I used: - mlx-community/Qwen3.6-35B-A3B-4bit - not Qwen/Qwen3.6-35B-A3B 6A) Start oMLX /Applications/oMLX.app/Contents/MacOS/omlx-cli serve 7A) Configure DFlash in oMLX Open the oMLX admin dashboard For Qwen3.6-35B-A3B-4bit: - enable DFlash - set Draft Model to Qwen3.6-35B-A3B-DFlash - set Draft quant bits to 4 - optional: set this as default For Qwen3.6-35B-A3B-DFlash: - leave DFlash disabled - do not set it as default - do not use it as the Pi model Settings used: - thinking budget enabled - thinking budget tokens: 8192 - TurboQuant KV: on - TurboQuant KV bits: 4 - TurboQuant skip last: on - DFlash: on - DFlash draft quant bits: 4 Option B: Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit 5B) Download the main model hf download mlx-community/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit \ --local-dir ~/.omlx/models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit 6B) Optional SpecPrefill draft model Qwen3.5-4B-MLX-4bit Download it with: hf download mlx-community/Qwen3.5-4B-MLX-4bit \ --local-dir ~/.omlx/models/Qwen3.5-4B-MLX-4bit Lower-memory alternative that was also recommended: hf download mlx-community/Qwen3.5-0.8B-4bit \ --local-dir ~/.omlx/models/Qwen3.5-0.8B-4bit 7B) Configure the 27B model in oMLX For Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit, the config I used: - max context window: 65536 - max tokens: 8192 - enable thinking: true - thinking budget enabled: true - thinking budget tokens: 1024 - TurboQuant KV: on - TurboQuant KV bits: 4 - TurboQuant skip last: true - SpecPrefill: on - SpecPrefill draft model: Qwen3.5-4B-MLX-4bit - SpecPrefill keep rate: 0.2 - DFlash: off Recommended fast coding preset If you want speed over depth: - max context window: 8192 or 16384 - max tokens: 1024 - TurboQuant KV: on - KV bits: 4 - SpecPrefill: on only when prompts are long - SpecPrefill keep rate: 0.2 - draft: Qwen3.5-4B-MLX-4bit or Qwen3.5-0.8B-4bit 9) Run Pi with the local oMLX models For Qwen3.6 + DFlash setup Use the base 4-bit model: '/Applications/oMLX.app/Contents/MacOS/omlx-cli' launch pi \ --model 'Qwen3.6-35B-A3B-4bit' \ --api-key "$OMLX_API_KEY" For Qwen3.5 27B distilled '/Applications/oMLX.app/Contents/MacOS/omlx-cli' launch pi \ --model 'Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit' \ --api-key "$OMLX_API_KEY"



Great thread with super useful information. I ended up testing Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit and using Qwen3.5-4B-MLX-4bit for SpecPrefill and it runs pretty good on my M2 Max 32 Gb. I wish I had bought the 64Gb version when I had the money though, because I have to close mostly everything to have enough available ram for it. However I'm impressed at how well it works with @badlogicgames Pi harness. I originally tried using Qwen3.6-35B-A3B-DFlash as the draft model for Qwen3.6-35B-A3B but it reqires much more RAM :( Definitely worth trying.



🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

Qwen3.6-27B is the first model that made me want that Mac Studio My current mac runs the 4bit version at just 15 tok/s. Loading the initial opencode system prompt takes like 30-50 seconds. Is there a draft model I could be using inside LM studio for faster inference?

Folks talk about the mental burden of working with multiple agents and context switching. I don't hear much about how tiring working with a single agent can be when you have a good planning workflow and a streamlined I/O interface (voice, comprehensive visual plan artifacts presented on screen, etc). Even working in serial mode you have to process a lot of information, digest it, and think deeply about the next decision before sending the agent off on its next task. To me, this is more tiring than interruptions from concurrent small back-and-forth interactions.









My @postgresconf talk on “Postgres for Production AI Agents” is at 1pm in San Pedro room (level C). Hope to see you there :)





