
Ollama 0.19 shipped this week with a meaningful architecture shift. The local inference engine now runs on Apple's MLX framework, and on M5-series chips the results are concrete: 1,851 tokens per second on prefill and 134 tokens per second on decode when running Qwen3.5-35B-A3B quantized to NVFP4.
The NVFP4 detail is the part practitioners should actually care about. NVIDIA's 4-bit floating point format is increasingly standard in production cloud deployments. By supporting it locally, Ollama closes the gap between what runs on your machine and what runs in production. Quantization variance, the longstanding bugbear of local AI development, gets smaller. You are no longer debugging a Q4_K_M artifact that behaves differently than the int4 deployment your team uses in production.
The catch sits where it always has with Apple Silicon: 32GB of unified memory minimum. This is not a democratic release. It is a performance release for users who already own the right hardware. Older Intel Macs, 16GB baseline machines, and anyone outside the Apple ecosystem see little change. For this tier of users, the benchmark numbers are largely academic.
That said, the trajectory matters. Apple Silicon's memory architecture has been theoretically ideal for LLM inference since the M1 launched, but the software stack never fully exploited it. Ollama's MLX work suggests the gap is finally narrowing. If this performance advantage holds for larger models, it raises a genuine question for the local AI community: is Apple quietly becoming the dominant inference platform for developers who can afford the ticket price?
English

























