stevibe

2.5K posts

stevibe

@stevibe

Fullstack | LLM | Local AI addict | Learning ML | Builds things nobody asked for. Benchmarks things for fun.

Katılım Temmuz 2009

1.5K Takip Edilen12.4K Takipçiler

Sabitlenmiş Tweet

stevibe@stevibe·24 Şub

Claude Sonnet 4.6, when asked in Chinese: “你是什么模型？” (What model are you?) Confidently replies: “我是 DeepSeek。” (I am DeepSeek) This is the same model whose company just accused DeepSeek of “industrial-scale distillation attacks”

English

353

1.3K

13.1K

1.9M

stevibe@stevibe·7m

x.com/i/article/2034…

ZXX

stevibe@stevibe·55m

@digitalix Always set up as new, this is the only chance given to you when you have a new Mac

English

Alex Ziskind@digitalix·6h

for the first time in 6 or more years, I’m thinking about this choice

English

6.6K

stevibe@stevibe·11h

@royjossfolk It's a web app, and I screen record it

English

152

Roy Jossfolk Jr.@royjossfolk·11h

@stevibe Quick question, how did you make this video?

English

170

stevibe@stevibe·12h

I gave 6 frontier coding models the same task: turn this emoji into an SVG, from scratch, in real time. Watching them stream their thinking before a single shape appears is wild — some plan meticulously, others just wing it. Models: - GPT-5.3 Codex - Claude Opus 4.6 - Gemini 3.1 Pro - MiniMax M2.7 - GLM-5 - Kimi K2.5

English

6.8K

stevibe@stevibe·11h

@chooseliberty It understands emojis best in this test!

English

157

Choose Liberty@chooseliberty·11h

@stevibe gemini was my favorite overall ngl

English

169

stevibe@stevibe·12h

@JiriCoufal77 Figuring out what it is,

English

Jiří Coufal@JiriCoufal77·12h

@stevibe and the winner is :D

English

stevibe@stevibe·12h

Ok but the real winner here: Kimi and GLM both put wasabi inside the sushi 🍣

English

357

stevibe@stevibe·19h

@thabiso_mots Thank you!

English

Thabiso Solomon M. Motswagole@thabiso_mots·1d

@stevibe Dope layout

English

160

stevibe@stevibe·1d

NVIDIA just dropped Nemotron-3-Nano:4b — a tiny 2.8GB model. Guess whose hardware runs it the fastest? - RTX 4090: 226 tok/s - RTX 3090: 187 tok/s - Mac Studio M2 Ultra: 86 tok/s - Mac Mini M4: 25 tok/s Home court advantage is real. Also trying a new layout with live performance charts. Lmk what you think!

English

133

1.1K

120.9K

stevibe@stevibe·19h

@rukasufall This is wild

English

376

rukasufall@rukasufall·1d

@stevibe The RTX 5070 Ti did around 214 tk/s. I’m really liking the capabilities of this nano.

English

576

stevibe@stevibe·19h

@zhaoxiongding Definitely. We are just comparing one of the factors here

English

123

Ding@zhaoxiongding·1d

@stevibe People don’t run a model because it’s fast people run a model because it’s good.

English

155

stevibe@stevibe·19h

@Nice1774036 Hey, if you want to run local models, the easiest one would be using ollama (the one that this test use), or LM Studio; for advanced usage, llama.cpp and vLLM are good choices.

English

Nice@Nice1774036·1d

@stevibe How I can use it from Beginning 😺. First time Visit your profile want to learn something else 😔. Need help have you upload any episode or video soo I can watch it deeply and will research and can use it

English

stevibe@stevibe·3d

You don't need a cloud API for great OCR anymore. GLM-OCR runs locally with just ~2GB VRAM, handles tables, math equations, and hits ~260 tok/s on a Mac Studio M2 Ultra. Local models are getting better AND smaller at a crazy pace. If you have a GPU or a Mac, you're already ready for the AI era. @Zai_org

English

181

1.9K

136.4K

stevibe@stevibe·19h

@changtimwu The real optimised version should be the NVFP4, but I am using a normal Q4 version here

English

127

Tim Wu@changtimwu·23h

@stevibe It seems Nemetron series models have been optimized for NV arch? Or gaming GPUs have advantages on executing SLMs around 4B?

English

159

stevibe@stevibe·19h

@abbly298 @Alibaba_Qwen Yeah the MLX version usually doubles the GUFF version

English

Abby@abbly298·20h

@stevibe @Alibaba_Qwen I tested MLX Qwen3.5 9B on the 2020 MacBook Pro 13-inch with the M1 chip and 16GB of RAM. Across all models, it managed to reach a throughput of 13 TOK/sec used 5.116 GB RAM , which is double what Ollama can do! I got 23 t/s for Qwen3.5 4B 4bit used 2.456 GB RAM.

English

stevibe@stevibe·2 Mar

Qwen3.5:9b reasoning head-to-head: Mac Studio M2 Ultra 64GB: 43.08 tok/s Mac Mini M4 16GB: 13.07 tok/s @Alibaba_Qwen

English

150

240.7K

stevibe@stevibe·20h

@Trader_PT Yes! Let's give it a try

English

128

Santos ⚡🇵🇹@Trader_PT·20h

@stevibe Will it run on a laptop with RTX 3060 6GB

English

140

stevibe@stevibe·1d

@meta_alex Yeah my DGX Spark is arriving next week, will add that in future tests!

English

Alex Skinner@meta_alex·1d

@stevibe Now try it on dgi sparx or any other unified memory based setup it’s not just nvidia vs others these are pure native vram cards

English

2.1K

stevibe@stevibe·1d

@Jimster4801 Explain quantum computing in simple terms

English

694

Jimsta@Jimster4801·1d

@stevibe What is your prompt here exactly?

English

725

stevibe@stevibe·1d

@ernestyalumni Correct!

English

Ernest Yeung@ernestyalumni·1d

@stevibe seems like the 2.8GB size hints at this one: huggingface.co/nvidia/NVIDIA-…

English

3.5K

stevibe@stevibe·1d

@ernestyalumni Grabbing the chunks from the API docs.ollama.com/capabilities/s…

English

145

Ernest Yeung@ernestyalumni·1d

@stevibe ok, noted. btw, how do you do the benchmark to get the tok/second stat?

English

157

stevibe@stevibe·1d

@nuvolore All tests use the same prompt, but I didn't touch the temperature and top_p settings in the test, will consider next time. Thanks for suggesting!

English

379