Sid Dev

69 posts

Sid Dev

Sid Dev

@sid_dev1

Chewing on life.

Inscrit le Mayıs 2026
48 Abonnements3 Abonnés
Sid Dev
Sid Dev@sid_dev1·
@DevaBuilds @Sapient_Int Naive goal but was trying to see if I could get the base model + HRM architecture to push into to Bonsai / LFM / Gemma territory in terms of performance (based on the simple benchmarks I did on them). I wasn't eval'ing on latency, purely curious about accuracy / answer quality.
Sid Dev tweet media
English
0
0
1
11
Deva
Deva@DevaBuilds·
@sid_dev1 @Sapient_Int Fine tuning gets so much less airtime than RAG. What task were you targeting, and did accuracy or latency move more?
English
1
0
1
10
Sid Dev
Sid Dev@sid_dev1·
I messed around with fine tuning @Sapient_Int 's HRM 1B model. I find HRM fascinating! Goal was to learn how fine tuning works and see if I could modestly improve off the base model. Note the benchmarking is crude as I'm basing it off my workflows and severely hardware limited.
Sid Dev tweet media
English
2
0
1
15
Sid Dev
Sid Dev@sid_dev1·
@0xSero Think people are sleeping on @ZyphraAI and smaller Nvidia models (self speculation / diffusion etc) wrt Deepseek-esq envelope pushing
English
0
0
0
888
0xSero
0xSero@0xSero·
Read a Deepseek model paper and then compare it to an Anthropic paper. How TF is one worth 1T I kid you not the Mythos card had 150+ em-dashes
0xSero tweet media
English
13
6
234
14.4K
Sid Dev
Sid Dev@sid_dev1·
Honestly I like this minimax m3 / opencode combo much better than antigravity / flash 3.5, probably not controversial
English
0
0
0
2
Sid Dev
Sid Dev@sid_dev1·
Really impressed with @opencode / @MiniMax_AI M3 combo. Used it for the first time and it pulled Nvidia Diffusion 3B 4 bit MLX out of a ditched by writing a custom MLX loader, ran it on my tiny M2 mini 8gb, and came up with some great insights.
English
1
0
0
16
Sid Dev
Sid Dev@sid_dev1·
@PavloMolchanov My apologies! I meant does it *only* apply to smaller deployments w/ less concurrency (and more free cores). But I think you answered both!
English
0
0
0
9
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
Diffusion models are a great fit to small concurrency. In normal AR, model weights are transfered from HBM to cache for every single token at decode, this regime is memory bound. With diffusion, you transfer weights once but generate multiple tokens, and this is a win. Amount of work to be done is more, but there free cores with small concurrency.
English
1
0
0
31
Paul Couvert
Paul Couvert@itsPaulAi·
@sid_dev1 Very interesting. What are you using to run the LiteRT version? llama cpp can handle it?
English
1
0
0
46
Paul Couvert
Paul Couvert@itsPaulAi·
That's massive for local AI Google has just released Gemma 4 QAT and it runs with 3x less memory! Remember GPT-4o? Gemma 4 E4B is better and can now run on your phone (!!) with just 2GB RAM. And Gemma 4 31B (~ Opus 4 level) can now run on your laptop.
Paul Couvert tweet media
English
17
16
220
16.2K
Sid Dev
Sid Dev@sid_dev1·
@itsPaulAi E4B QAT still crashed it without limiting GPU offload, etc. But the E4B lite version cruised at ~27tps!
English
1
0
0
54
GooGZ AI
GooGZ AI@PaulGugAI·
Update on my side-quest to get Hermes Desktop & a local LLM running on a modest MacBook Air M2 16GB. Got Qwen 3.5 9B (Q4) running, but still only seeing ~12 tok/s with llama.ccp. I switched to rapid-mlx server and this seems to have the speed closer to 15-18 t/s, however it's still too slow for Hermes Agent based usage. It's taking many minutes between responses. I thought a 9B model could work acceptably here. But this is no where near it 🫤. Perhaps only 4B class is possible ? I'm a real newbie with local LLMs, so am open to ideas..
GooGZ AI tweet media
GooGZ AI@PaulGugAI

I've got Gemma 4 12B (Q4) running on my M2 MacBook Air. Hermes Agent Desktop is running, and... it's too slow! 😑 Fellow 16GB MacBook users, stick to 8B class in the meantime please. There is still hope however, will tweak further with quant and also been hearing good things about omlx (which caches context to SSD). I'm not giving up yet!

English
19
2
31
5.5K
Sid Dev
Sid Dev@sid_dev1·
@osanseviero This is awesome, thanks for thinking of the little guy / edge.
English
0
0
0
51
Omar Sanseviero
Omar Sanseviero@osanseviero·
Introducing Gemma 4 QAT 🤏 - Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨
Omar Sanseviero tweet media
English
33
58
631
38.1K
Sid Dev
Sid Dev@sid_dev1·
Update, swapped to Deepseek v4 flash...it crushed it and then some.
Sid Dev@sid_dev1

@OnlyTerp Currently trying to have Nemo 3 Ultra (Hermes harness) simply update a blog post, write a new blog post, and upload to my website and it is looping badly with a failed tool calls. DSv4 flash / Mimo 2.7 etc handled this repeatable task no problem.

English
0
0
0
8
Sid Dev
Sid Dev@sid_dev1·
@Teknium Loving the desktop app, kudos to the whole team.
English
0
0
2
74
Sid Dev
Sid Dev@sid_dev1·
@OnlyTerp Currently trying to have Nemo 3 Ultra (Hermes harness) simply update a blog post, write a new blog post, and upload to my website and it is looping badly with a failed tool calls. DSv4 flash / Mimo 2.7 etc handled this repeatable task no problem.
English
0
0
0
30
Terp
Terp@OnlyTerp·
Tested Nemotron 3 Ultra all day, If this dashboard is correct which it should be, says I used like 250million tokens today 😂 In some tasks, It just feels sota & amazing, but in alot of tasks unfortunately, it drops the ball and goes down a path of wasting a ton of your time
Terp tweet mediaTerp tweet media
English
4
0
26
1.4K
Sid Dev
Sid Dev@sid_dev1·
@rishiiyer01 How would it be different than stuff posted on Zyphra official blog?
English
0
0
0
11
rishi
rishi@rishiiyer01·
should i make a blog
English
1
0
1
126
Sid Dev
Sid Dev@sid_dev1·
@Youssofal_ I found it to be unusable, thoroughly confused by the hype
English
0
0
0
211
Youssof Altoukhi
Youssof Altoukhi@Youssofal_·
Oh man you can’t make this up, New 198B parameter model loses to 27B parameter model from 1 month ago. QWEN 3.6 27B is a monster.
Youssof Altoukhi tweet media
Artificial Analysis@ArtificialAnlys

StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API

English
18
10
282
28.9K