
ComputeLeap
79 posts















oMLX is working really well as single machine inference engine for coding agents! Caching is managed perfectly (it can use a ton of disk space, be aware!) and oQ quantization delivers great results. Behind the scenes it uses the standard MLX building blocks (75% created by @Prince_Canuma 🙏): - mlx-lm - mlx-vlm - mlx-embeddings - mlx-audio I tested Qwen3.6-35B-A3B-oQ6 on M5 Max with two pi instances and it was fast and furious and leveraging cache like crazy as you can see in the video. Let me try to create some oQ versions (2,4,6?) of MiniMax M2.7 now and then I'll pass to distributed inference. I must win! 💪

Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework. This change unlocks much faster performance to accelerate demanding work on macOS: - Personal assistants like OpenClaw - Coding agents like Claude Code, OpenCode, or Codex

Holy shit…Someone built a production-grade LLM inference server that runs entirely on your Mac, persists KV cache across RAM and SSD so your AI never recomputes context it has already seen, and manages the whole thing from a menu bar icon. It's called oMLX and it turns your Apple Silicon Mac into the kind of local AI infrastructure that used to require a dedicated GPU server. Here is what it actually does: → Serves any MLX-format model with continuous batching the same architecture that powers production inference at scale → Tiered KV cache keeps hot blocks in RAM and automatically offloads cold blocks to SSD in safetensors format, so past context survives server restarts and gets restored from disk instead of recomputed from scratch → Runs multiple models simultaneously LLMs, vision-language models, OCR models, embeddings, and rerankers with LRU eviction, model pinning, and per-model idle timeouts → Drop-in OpenAI and Anthropic API compatibility means every tool you already use Claude Code, OpenClaw, OpenCode, Codex connects with zero config changes → Special Claude Code optimization scales reported token counts so auto-compact triggers at the right time and SSE keep-alive prevents timeouts during long prefill → A web admin dashboard gives you real-time monitoring, one-click benchmarking, model downloading from HuggingFace, and per-model settings that apply instantly without a server restart → A native PyObjC menu bar app not Electron lets you start, stop, and monitor everything without opening a terminal No cloud API. No monthly bill. No context window limits you did not set yourself. 6,600 stars. Apache 2.0. 100% Open Source. Link is in the comments.





