cheeker

128 posts

cheeker

cheeker

@realcheeker

ex fb ml ai fu ck u

Los Angeles, CA เข้าร่วม Ocak 2023
150 กำลังติดตาม61 ผู้ติดตาม
cheeker
cheeker@realcheeker·
@minviable_org you can try the dockerfile but i'm not sure if maxq will be the same i don't really see why not, it's supposed to be the same card under the hood
English
0
0
0
39
Andrey Kolesnikov
Andrey Kolesnikov@minviable_org·
@realcheeker I fed the article to gpt 5.5 twice and it fails both times in the same sport. I wonder if I’m hitting hardware divergence of sorts, I have dual RTX6k on Ryzen 9/B850 mobo, mine are Max-Qs.
English
1
0
0
42
cheeker
cheeker@realcheeker·
i got DeepSeek V4 Flash running on my 2x RTX 6000 Pro server benchmarks put it close to Opus level, yet its right next to me on the floor 2100 tok/s at its very best case (TP=2, vLLM)
cheeker tweet media
English
28
4
236
22.8K
cheeker
cheeker@realcheeker·
@Hikari_07_jp oh sorry i am, i thought it was the liquid release today lol but still interested in this one too :p
English
1
0
1
29
Hikari∣LocalLLM⚡
Hikari∣LocalLLM⚡@Hikari_07_jp·
@realcheeker Are you mistaking this for something else? The original tweet says that Step 3.7 Flash is 198B-A11B. I'm planning to try this model with NVFP4 and I'm downloading it now.
Hikari∣LocalLLM⚡ tweet media
English
1
0
0
132
Hikari∣LocalLLM⚡
Hikari∣LocalLLM⚡@Hikari_07_jp·
I'll try this out with my setup! I'm so glad they released a model in this size.
StepFun@StepFun_ai

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: github.com/stepfun-ai/Ste… HuggingFace: huggingface.co/stepfun-ai/Ste… GGUF: huggingface.co/stepfun-ai/Ste… ModelScope: modelscope.cn/models/stepfun… API: platform.stepfun.ai Blog: static.stepfun.com/blog/step-3.7-…

English
2
0
11
1.2K
cheeker
cheeker@realcheeker·
@IanHailey @unwitty only the official, might try canadaquant at some point lmk how yours goes if you go that route
English
0
0
0
11
Ian Hailey
Ian Hailey@IanHailey·
@realcheeker @unwitty Excellent, been trying to get a docker image build for this for a while, which HF model did you try, any with MTP (e.g. Canada-Quant or LordNeel)?
English
1
0
1
23
cheeker
cheeker@realcheeker·
opus 4.8 is like insanely fast or what? first impressions seem crazy so far
English
1
0
1
72
cheeker
cheeker@realcheeker·
@findwildruzz mixed precision, MoE experts are fp4, shared layers and kv cache are fp8
English
0
0
5
517
ruzz
ruzz@findwildruzz·
@realcheeker What is the quantisation you are using? Results are insane since it is also my daily driver for everything (also code)
English
1
0
0
557
cheeker
cheeker@realcheeker·
@unwitty gotchu b, i added the dockerfile at the top of the article too if you wanna skip all the bs
English
1
0
1
123
Unwitty
Unwitty@unwitty·
@realcheeker Nice work and thanks for writing this up. I’m gonna give this a go on my dual system too!
English
1
0
1
140
cheeker
cheeker@realcheeker·
@Hikari_07_jp lol ya was about to do the same seeing @0xSero's SGLang stuff but want to stay vLLM native
English
0
0
1
111
Hikari∣LocalLLM⚡
Hikari∣LocalLLM⚡@Hikari_07_jp·
@realcheeker Thank you! I was just about to fork SGLang for the same purpose. This is a huge help. It's reassuring to know there are other FFs with the same setup.
English
1
0
1
152
cheeker
cheeker@realcheeker·
@sakurayukiai mixed precision, MoE experts are fp4, shared layers and kv cache are fp8 the checkpoint came this way so i didnt have to do any of it, got lucky that it barely fit at like 94GB per card also i understand the pain, i was doing work on a 4080 until recently :p
English
0
0
0
744
Sakura Yuki
Sakura Yuki@sakurayukiai·
@realcheeker My 5070 Ti setup is crying just looking at this. 2100 tok/s is absurd, are you running straight FP8 or did you have to quantize to leave room for the KV cache?
English
1
0
5
909
cheeker
cheeker@realcheeker·
@Hikari_07_jp a lot of pain lol you need jasl vLLM fork, leavelet DeepGEMM fork, and then a lot of figuring out random bugs and environment variables needed ill comment with more info in a sec but this thread was immensely helpful github.com/deepseek-ai/De…
English
0
0
1
79
cheeker
cheeker@realcheeker·
this model will clearly be the backbone of my overnight AI system and business overall i see this as my plan mode model and then i swap it out for coding workers like Qwen3.6-27B to execute while my mac orchestrator handles overseeing of the execution all i know is the rig is nuts and im running this 24/7 now
cheeker tweet media
English
0
1
20
1.5K
cheeker
cheeker@realcheeker·
here's what i got on concurrency, two tests bounding real world conditions: one where each request shares context (best case) vs one where each request shares no context (worst case) not perfectly apples to apples but it's just an attempt at finding upper / lower bounds
cheeker tweet media
English
2
0
7
1.7K
cheeker
cheeker@realcheeker·
@Hikari_07_jp nice! are you using both? this is the setup i ended up converging to as well, setting it up today
English
0
0
0
10
cheeker
cheeker@realcheeker·
@Hikari_07_jp would highly recommend understanding the decision making nvidia does for their cutting edge tech jensen in GTC literally gives out the future 1-2 years in advance like their $20B acquisition of Groq and how they're integrating that with vera rubin is seeming like the future
English
1
0
1
142
cheeker
cheeker@realcheeker·
@Hikari_07_jp ahh yeah definitely crazy engineering in them
English
1
0
1
124
Hikari∣LocalLLM⚡
Hikari∣LocalLLM⚡@Hikari_07_jp·
I bought a GPU to run local LLMs, but now I'm really drawn to GPUs themselves, lol. Is anyone else in the same situation?
English
20
1
134
10.9K