Muhammad Ayan: "oh my.. this shouldn't be possible a 1B model that runs inside your browser, be"

Post

Muhammad Ayan@socialwithaayan·6d

oh my.. this shouldn't be possible a 1B model that runs inside your browser, beats every model its size, and comes with its own desktop pet. MiniCPM-5 1B just changed the game for on-device AI. here's everything you need to know 🧵

English

146

61.9K

Muhammad Ayan@socialwithaayan·6d

so what is it? MiniCPM-5 1B is a 1-billion parameter text model by @OpenBMB. part of the "Pocket Rocket" series. it's built for one thing: running powerful AI locally on any device you own. your laptop. your phone. even inside a browser tab.

English

1.5K

Muhammad Ayan@socialwithaayan·6d

the benchmarks are insane for 1B. MiniCPM-5 1B vs the competition: > 48.85 on MMLU-Pro (Qwen3.5: 42.74) > 70.06 on MMLU-Redux (Qwen3.5: 61.50) > 91.60 on MATH-500 (Qwen3.5: 30.40) > 40.42 on AIME-2025 (Qwen3.5: 1.04) > 79.53 on τ²-Bench (Qwen3.5: 19.60) it destroys Qwen3.5-0.8B, Qwen3-0.6B, and LFM2.5-1.2B-Thinking across the board. knowledge. math. code. tool calling. it leads everywhere.

Suomi

1.2K

Muhammad Ayan@socialwithaayan·6d

but here's the coolest part.. it comes with a desktop pet app. a small AI companion that lives on your screen like a pixel buddy. I installed it on my Mac. loaded the model. and within minutes it was chatting with me right on my desktop. no cloud. no API costs. no internet needed. just a local AI pet you actually own.

English

1.4K

Muhammad Ayan@socialwithaayan·6d

and it runs literally everywhere. here's the breakdown: > FP16: ~2GB VRAM (GPU / MacBook / server, zero loss) > INT8: ~1GB (laptop / edge box, near-lossless) > INT4/Q4: ~0.5GB (phone / tablet / even a car system) inference via llama.cpp, ollama, vLLM, Sglang, Hugging Face, and ArcLight. ArcLight is their open-source CPU inference framework. you can run a full LLM inside a Chrome tab. 0.5GB. on a phone. let that sink in.

English

1.4K

kextcache@kextcache·6d

@socialwithaayan @socialwithaayan 0.5GB numbers look clean but sustained inference is where it gets ugly. KV cache on edge quants blows up fast with ctx length. Test under real prompts not cold load, and watch nvidia-smi through the whole session

English

Paylaş