Sid Dev

69 posts

Sid Dev

@sid_dev1

Chewing on life.

Inscrit le Mayıs 2026

48 Abonnements3 Abonnés

Sid Dev@sid_dev1·43m

@DevaBuilds @Sapient_Int Naive goal but was trying to see if I could get the base model + HRM architecture to push into to Bonsai / LFM / Gemma territory in terms of performance (based on the simple benchmarks I did on them). I wasn't eval'ing on latency, purely curious about accuracy / answer quality.

English

Deva@DevaBuilds·3h

@sid_dev1 @Sapient_Int Fine tuning gets so much less airtime than RAG. What task were you targeting, and did accuracy or latency move more?

English

Sid Dev@sid_dev1·11h

I messed around with fine tuning @Sapient_Int 's HRM 1B model. I find HRM fascinating! Goal was to learn how fine tuning works and see if I could modestly improve off the base model. Note the benchmarking is crude as I'm basing it off my workflows and severely hardware limited.

English

Sid Dev@sid_dev1·3h

@0xSero Think people are sleeping on @ZyphraAI and smaller Nvidia models (self speculation / diffusion etc) wrt Deepseek-esq envelope pushing

English

888

0xSero@0xSero·3h

Read a Deepseek model paper and then compare it to an Anthropic paper. How TF is one worth 1T I kid you not the Mythos card had 150+ em-dashes

English

234

14.4K

Sid Dev@sid_dev1·4h

Honestly I like this minimax m3 / opencode combo much better than antigravity / flash 3.5, probably not controversial

English

Sid Dev@sid_dev1·3d

Well this shows how bad Stepfun 2.7 and other stuff I've been using have been given how other people feel about this combo x.com/Mayhem4Markets…

Markets & Mayhem@Mayhem4Markets

The market's giving MiniMax-M3 a vote of no confidence, as shares tumble down a massive 15.71% after running up into the M3 release. Hearing from more and more people that this feels like a benchmaxxed model. Again, hope to be proven wrong. But this seems quite bad.

English

Sid Dev@sid_dev1·3d

Really impressed with @opencode / @MiniMax_AI M3 combo. Used it for the first time and it pulled Nvidia Diffusion 3B 4 bit MLX out of a ditched by writing a custom MLX loader, ran it on my tiny M2 mini 8gb, and came up with some great insights.

English

Sid Dev@sid_dev1·5h

@PavloMolchanov My apologies! I meant does it *only* apply to smaller deployments w/ less concurrency (and more free cores). But I think you answered both!

English

Pavlo Molchanov@PavloMolchanov·5h

Diffusion models are a great fit to small concurrency. In normal AR, model weights are transfered from HBM to cache for every single token at decode, this regime is memory bound. With diffusion, you transfer weights once but generate multiple tokens, and this is a win. Amount of work to be done is more, but there free cores with small concurrency.

English

Pavlo Molchanov@PavloMolchanov·6h

Imaging what we will achieve with diffusion LLMs. Decode/generation is compute bound in this case, nvfp4 will provide even higher gains then just q4 as we will not pay for quantization.

witcheer@witcheer

everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K_M (same 4-bit budget) >Q6_K - same llama.cpp b9365 and same harness. ~~~ prefill (processing your prompt): NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute. vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6). ~~~ decode (generating tokens): here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB). decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed. prefill = compute, decode = size. ~~~ the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7). net, vs the Q6 a lot of people serve: ~+60% prefill +36% decode -30% VRAM (17.3 vs 23.5GB) for -0.8 quality. for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.

English

4.4K

Sid Dev@sid_dev1·5h

@itsPaulAi Had antigravity send it

English

Paul Couvert@itsPaulAi·5h

@sid_dev1 Very interesting. What are you using to run the LiteRT version? llama cpp can handle it?

English

Paul Couvert@itsPaulAi·7h

That's massive for local AI Google has just released Gemma 4 QAT and it runs with 3x less memory! Remember GPT-4o? Gemma 4 E4B is better and can now run on your phone (!!) with just 2GB RAM. And Gemma 4 31B (~ Opus 4 level) can now run on your laptop.

English

220

16.2K

Sid Dev@sid_dev1·5h

@itsPaulAi Here's what I got so far

English

Sid Dev@sid_dev1·5h

@itsPaulAi E4B QAT still crashed it without limiting GPU offload, etc. But the E4B lite version cruised at ~27tps!

English

Sid Dev@sid_dev1·5h

@PaulGugAI @Teknium @NeoAIForecast M2 mini 8gb checking in.

English

GooGZ AI@PaulGugAI·15h

Update on my side-quest to get Hermes Desktop & a local LLM running on a modest MacBook Air M2 16GB. Got Qwen 3.5 9B (Q4) running, but still only seeing ~12 tok/s with llama.ccp. I switched to rapid-mlx server and this seems to have the speed closer to 15-18 t/s, however it's still too slow for Hermes Agent based usage. It's taking many minutes between responses. I thought a 9B model could work acceptably here. But this is no where near it 🫤. Perhaps only 4B class is possible ? I'm a real newbie with local LLMs, so am open to ideas..

GooGZ AI@PaulGugAI

I've got Gemma 4 12B (Q4) running on my M2 MacBook Air. Hermes Agent Desktop is running, and... it's too slow! 😑 Fellow 16GB MacBook users, stick to 8B class in the meantime please. There is still hope however, will tweak further with quant and also been hearing good things about omlx (which caches context to SSD). I'm not giving up yet!

English

5.5K

Sid Dev@sid_dev1·9h

@osanseviero This is awesome, thanks for thinking of the little guy / edge.

English

Omar Sanseviero@osanseviero·9h

Get started today! blog.google/innovation-and…

English

3.1K

Omar Sanseviero@osanseviero·9h

Introducing Gemma 4 QAT 🤏 - Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨

English

631

38.1K

Sid Dev@sid_dev1·11h

Here's the AI written summary of the fine turning journey (a bit too tired to do this by hand right now). reddeerinv.com/ai/hrm-fine-tu…

English

Sid Dev@sid_dev1·11h

Update, swapped to Deepseek v4 flash...it crushed it and then some.

Sid Dev@sid_dev1

@OnlyTerp Currently trying to have Nemo 3 Ultra (Hermes harness) simply update a blog post, write a new blog post, and upload to my website and it is looping badly with a failed tool calls. DSv4 flash / Mimo 2.7 etc handled this repeatable task no problem.

English

Sid Dev@sid_dev1·12h

@Teknium Loving the desktop app, kudos to the whole team.

English

Teknium 🪽@Teknium·13h

🫡🫡

Absolut3Chang3@TrialByFireLab

Damn @Teknium! Hermes Agent feels lightning fast! Massive improvements in speed. Super impressed, I'm used to giving my agent a command and waiting a minute or two but not anymore.

ART

7.5K

Sid Dev@sid_dev1·12h

English

Terp@OnlyTerp·21h

Tested Nemotron 3 Ultra all day, If this dashboard is correct which it should be, says I used like 250million tokens today 😂 In some tasks, It just feels sota & amazing, but in alot of tasks unfortunately, it drops the ball and goes down a path of wasting a ton of your time

English

1.4K

Sid Dev@sid_dev1·12h

@rishiiyer01 How would it be different than stuff posted on Zyphra official blog?

English

rishi@rishiiyer01·1d

should i make a blog

English

126

Sid Dev@sid_dev1·1d

@LottoLabs We eat good on @NousResearch portal too 🍽️

English

Lotto@LottoLabs·1d

Gpu poor bros eating good on opencode Holy smokes

OpenCode@opencode

Nemotron 3 Ultra is now free on OpenCode text · 1M context · fully open source NVIDIA's latest open source model

English

485

33.2K

Sid Dev@sid_dev1·1d

Learning a lot running SFT on this base 1b HRM model. All running on an 8gb Mac mini and Gemini flash 3.5. It isn’t easy but we’re in it to climb hills.

Sapient Intelligence@Sapient_Int

In this benchmark deep-dive, Sapient’s founders William and Guan are joined by research team members Changling and Yasin to unpack HRM-Text’s performance across MATH, DROP, ARC-Challenge, and MMLU. 📊 Beyond the scores, they discuss what each benchmark measures, how HRM-Text compares with larger models, and why efficiency matters. Watch the full discussion to learn more about HRM-Text and Sapient’s leaner path toward general intelligence.

English

Sid Dev@sid_dev1·1d

@Youssofal_ I found it to be unusable, thoroughly confused by the hype

English

211

Youssof Altoukhi@Youssofal_·1d

Oh man you can’t make this up, New 198B parameter model loses to 27B parameter model from 1 month ago. QWEN 3.6 27B is a monster.

Artificial Analysis@ArtificialAnlys

StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API

English

282

28.9K

Découvrir

@DevaBuilds @Sapient_Int @0xSero @ZyphraAI @opencode @MiniMax_AI @PavloMolchanov @itsPaulAi