
Gemma 4 architecture analysis thread Just as Gemma3n, this thing has a galaxybrained architecture, very much not a standard transformer
Mirai Labs
86 posts

@trymirai
Frontier on-device AI lab. Models, runtime & infrastructure to make on-device AI interactive, ambient & continuous.

Gemma 4 architecture analysis thread Just as Gemma3n, this thing has a galaxybrained architecture, very much not a standard transformer



Day 0 support across the stack: > Hardware: @AMD, @Intel, @Qualcomm > On-device: @lmstudio , @Cactuscompute, @RunAnywhereAI , @zeticai_ , @trymirai > Customization: @distil_labs





(1/n) I recently joined @trymirai, where we are working on LLM inference targeting Apple Silicon. Lately I've been digging into quantization. LLM inference is mostly memory-bound. The byte/FLOP ratio is high enough that a lot of the machine's time goes to moving data around instead of doing compute. Quantization helps with that in general, but on Apple Silicon there's an extra payoff: the GPU has a fast W8A8 path. If both weights and activations are INT8, you can use that path for prefill and speculative-decoding verification. Weights are easy since they're static and can be quantized offline. Activations are where the real pain starts.

We are doing really cool hard tech at @trymirai, but until recently our social media feeds were full of linkedinish cringe. We decided to fix it and share more technial content I am currently working on our quantization pipeline, so here is a thread about LLM quantization




Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping moved. gradient G has SVD G = U Sigma V^T. Muon's update is just U V^T. keep the directions, throw away the magnitudes

1/ Apple shipped Metal Performance Primitives — a GPU matmul API built on cooperative_tensor. If you look at Apple's open-source code for an example of how to use MPP, you'll find a hardcoded M5 memory layout.





Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device. Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini! Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh. Details in thread.

