cheeker

128 posts

cheeker

@realcheeker

ex fb ml ai fu ck u

Los Angeles, CA Katılım Ocak 2023

150 Takip Edilen61 Takipçiler

cheeker@realcheeker·2d

@minviable_org you can try the dockerfile but i'm not sure if maxq will be the same i don't really see why not, it's supposed to be the same card under the hood

English

Andrey Kolesnikov@minviable_org·2d

@realcheeker I fed the article to gpt 5.5 twice and it fails both times in the same sport. I wonder if I’m hitting hardware divergence of sorts, I have dual RTX6k on Ryzen 9/B850 mobo, mine are Max-Qs.

English

cheeker@realcheeker·3d

i got DeepSeek V4 Flash running on my 2x RTX 6000 Pro server benchmarks put it close to Opus level, yet its right next to me on the floor 2100 tok/s at its very best case (TP=2, vLLM)

English

237

22.8K

cheeker@realcheeker·2d

@Hikari_07_jp oh sorry i am, i thought it was the liquid release today lol but still interested in this one too :p

English

Hikari∣LocalLLM⚡@Hikari_07_jp·2d

@realcheeker Are you mistaking this for something else? The original tweet says that Step 3.7 Flash is 198B-A11B. I'm planning to try this model with NVFP4 and I'm downloading it now.

English

132

Hikari∣LocalLLM⚡@Hikari_07_jp·2d

I'll try this out with my setup! I'm so glad they released a model in this size.

StepFun@StepFun_ai

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: github.com/stepfun-ai/Ste… HuggingFace: huggingface.co/stepfun-ai/Ste… GGUF: huggingface.co/stepfun-ai/Ste… ModelScope: modelscope.cn/models/stepfun… API: platform.stepfun.ai Blog: static.stepfun.com/blog/step-3.7-…

English

1.2K

cheeker@realcheeker·2d

what's our thoughts on this? Qwen3.6-35B-A3B replacement?

Liquid AI@liquidai

Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases. > 8B MoE, 1.5B active > Expanded 128K context > LFM2.5 flagship hybrid MoE architecture > Trained on 38T tokens + large-scale RL > fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size > customizable on a single GPU for any specialized task > LFM2 open-weight license 🧵

English

135

cheeker@realcheeker·2d

@IanHailey @unwitty only the official, might try canadaquant at some point lmk how yours goes if you go that route

English

Ian Hailey@IanHailey·3d

@realcheeker @unwitty Excellent, been trying to get a docker image build for this for a while, which HF model did you try, any with MTP (e.g. Canada-Quant or LordNeel)?

English

cheeker@realcheeker·3d

x.com/i/article/2059…

ZXX

6.5K

cheeker@realcheeker·2d

opus 4.8 is like insanely fast or what? first impressions seem crazy so far

English

cheeker@realcheeker·3d

@findwildruzz mixed precision, MoE experts are fp4, shared layers and kv cache are fp8

English

517

ruzz@findwildruzz·3d

@realcheeker What is the quantisation you are using? Results are insane since it is also my daily driver for everything (also code)

English

557

cheeker@realcheeker·3d

@unwitty gotchu b, i added the dockerfile at the top of the article too if you wanna skip all the bs

English

123

Unwitty@unwitty·3d

@realcheeker Nice work and thanks for writing this up. I’m gonna give this a go on my dual system too!

English

140

cheeker@realcheeker·3d

@Hikari_07_jp lol ya was about to do the same seeing @0xSero's SGLang stuff but want to stay vLLM native

English

111

Hikari∣LocalLLM⚡@Hikari_07_jp·3d

@realcheeker Thank you! I was just about to fork SGLang for the same purpose. This is a huge help. It's reassuring to know there are other FFs with the same setup.

English

152

cheeker@realcheeker·3d

@sakurayukiai mixed precision, MoE experts are fp4, shared layers and kv cache are fp8 the checkpoint came this way so i didnt have to do any of it, got lucky that it barely fit at like 94GB per card also i understand the pain, i was doing work on a 4080 until recently :p

English

744

Sakura Yuki@sakurayukiai·3d

@realcheeker My 5070 Ti setup is crying just looking at this. 2100 tok/s is absurd, are you running straight FP8 or did you have to quantize to leave room for the KV cache?

English

909

cheeker@realcheeker·3d

@Hikari_07_jp here's an article from the info i pulled off my claude x.com/realcheeker/st…

cheeker@realcheeker

x.com/i/article/2059…

English

1.5K

Hikari∣LocalLLM⚡@Hikari_07_jp·3d

@realcheeker I believe it's not compatible with the SM120. How did you deal with that?

English

1.4K

cheeker@realcheeker·3d

@Hikari_07_jp a lot of pain lol you need jasl vLLM fork, leavelet DeepGEMM fork, and then a lot of figuring out random bugs and environment variables needed ill comment with more info in a sec but this thread was immensely helpful github.com/deepseek-ai/De…

English

cheeker@realcheeker·3d

this model will clearly be the backbone of my overnight AI system and business overall i see this as my plan mode model and then i swap it out for coding workers like Qwen3.6-27B to execute while my mac orchestrator handles overseeing of the execution all i know is the rig is nuts and im running this 24/7 now

English

1.5K

cheeker@realcheeker·3d

here's what i got on concurrency, two tests bounding real world conditions: one where each request shares context (best case) vs one where each request shares no context (worst case) not perfectly apples to apples but it's just an attempt at finding upper / lower bounds

English

1.7K

cheeker@realcheeker·5d

@Hikari_07_jp nice! are you using both? this is the setup i ended up converging to as well, setting it up today

English

cheeker@realcheeker·6d

@Hikari_07_jp would highly recommend understanding the decision making nvidia does for their cutting edge tech jensen in GTC literally gives out the future 1-2 years in advance like their $20B acquisition of Groq and how they're integrating that with vera rubin is seeming like the future

English

142

cheeker@realcheeker·6d

@Hikari_07_jp ahh yeah definitely crazy engineering in them

English

124

Hikari∣LocalLLM⚡@Hikari_07_jp·6d

I bought a GPU to run local LLMs, but now I'm really drawn to GPUs themselves, lol. Is anyone else in the same situation?

English

134

10.9K

Keşfet

@minviable_org @Hikari_07_jp @IanHailey @unwitty @findwildruzz @0xSero @sakurayukiai @elonmusk