

Sid Dev
69 posts







The market's giving MiniMax-M3 a vote of no confidence, as shares tumble down a massive 15.71% after running up into the M3 release. Hearing from more and more people that this feels like a benchmaxxed model. Again, hope to be proven wrong. But this seems quite bad.




everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K_M (same 4-bit budget) >Q6_K - same llama.cpp b9365 and same harness. ~~~ prefill (processing your prompt): NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute. vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6). ~~~ decode (generating tokens): here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB). decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed. prefill = compute, decode = size. ~~~ the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7). net, vs the Q6 a lot of people serve: ~+60% prefill +36% decode -30% VRAM (17.3 vs 23.5GB) for -0.8 quality. for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.




I've got Gemma 4 12B (Q4) running on my M2 MacBook Air. Hermes Agent Desktop is running, and... it's too slow! 😑 Fellow 16GB MacBook users, stick to 8B class in the meantime please. There is still hope however, will tweak further with quant and also been hearing good things about omlx (which caches context to SSD). I'm not giving up yet!



@OnlyTerp Currently trying to have Nemo 3 Ultra (Hermes harness) simply update a blog post, write a new blog post, and upload to my website and it is looping badly with a failed tool calls. DSv4 flash / Mimo 2.7 etc handled this repeatable task no problem.


Nemotron 3 Ultra is now free on OpenCode text · 1M context · fully open source NVIDIA's latest open source model

In this benchmark deep-dive, Sapient’s founders William and Guan are joined by research team members Changling and Yasin to unpack HRM-Text’s performance across MATH, DROP, ARC-Challenge, and MMLU. 📊 Beyond the scores, they discuss what each benchmark measures, how HRM-Text compares with larger models, and why efficiency matters. Watch the full discussion to learn more about HRM-Text and Sapient’s leaner path toward general intelligence.



StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API