Sabitlenmiş Tweet
stevibe
2.5K posts

stevibe
@stevibe
Fullstack | LLM | Local AI addict | Learning ML | Builds things nobody asked for. Benchmarks things for fun.
Katılım Temmuz 2009
1.5K Takip Edilen12.4K Takipçiler

@digitalix Always set up as new, this is the only chance given to you when you have a new Mac
English

I gave 6 frontier coding models the same task: turn this emoji into an SVG, from scratch, in real time.
Watching them stream their thinking before a single shape appears is wild — some plan meticulously, others just wing it.
Models:
- GPT-5.3 Codex
- Claude Opus 4.6
- Gemini 3.1 Pro
- MiniMax M2.7
- GLM-5
- Kimi K2.5
English

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest?
- RTX 4090: 226 tok/s
- RTX 3090: 187 tok/s
- Mac Studio M2 Ultra: 86 tok/s
- Mac Mini M4: 25 tok/s
Home court advantage is real.
Also trying a new layout with live performance charts. Lmk what you think!
English

@stevibe The RTX 5070 Ti did around 214 tk/s.
I’m really liking the capabilities of this nano.
English

@zhaoxiongding Definitely. We are just comparing one of the factors here
English

@Nice1774036 Hey, if you want to run local models, the easiest one would be using ollama (the one that this test use), or LM Studio; for advanced usage, llama.cpp and vLLM are good choices.
English

You don't need a cloud API for great OCR anymore.
GLM-OCR runs locally with just ~2GB VRAM, handles tables, math equations, and hits ~260 tok/s on a Mac Studio M2 Ultra.
Local models are getting better AND smaller at a crazy pace. If you have a GPU or a Mac, you're already ready for the AI era.
@Zai_org
English

@changtimwu The real optimised version should be the NVFP4, but I am using a normal Q4 version here
English

@abbly298 @Alibaba_Qwen Yeah the MLX version usually doubles the GUFF version
English

@stevibe @Alibaba_Qwen I tested MLX Qwen3.5 9B on the 2020 MacBook Pro 13-inch with the M1 chip and 16GB of RAM. Across all models, it managed to reach a throughput of 13 TOK/sec used 5.116 GB RAM , which is double what Ollama can do! I got 23 t/s for Qwen3.5 4B 4bit used 2.456 GB RAM.
English

Qwen3.5:9b reasoning head-to-head:
Mac Studio M2 Ultra 64GB: 43.08 tok/s
Mac Mini M4 16GB: 13.07 tok/s
@Alibaba_Qwen
English

@meta_alex Yeah my DGX Spark is arriving next week, will add that in future tests!
English

@stevibe Now try it on dgi sparx or any other unified memory based setup it’s not just nvidia vs others these are pure native vram cards
English

@stevibe seems like the 2.8GB size hints at this one: huggingface.co/nvidia/NVIDIA-…
English

@ernestyalumni Grabbing the chunks from the API docs.ollama.com/capabilities/s…
English

@stevibe ok, noted. btw, how do you do the benchmark to get the tok/second stat?
English

@stevibe Have you tried setting temperature to 0, top_p to 1 and identical seed?
English












