Bleys Goodson

30 posts

Bleys Goodson

@bleysg

Helping people engineer the future.

SF / LA Beigetreten Mart 2009

400 Folgt50 Follower

Bleys Goodson@bleysg·1d

Excellent details here. Congrats on the Kog release. Wonderful to see more across-the -stack optimized work on bringing latency in GPU inference towards the floor.

Kog@Kog__AI

We get those microseconds back by co-designing three layers that are normally tuned in isolation: the runtime, the low-level GPU code (including collective-communication), and the model architecture itself. The monokernel: our entire decode pass runs as one persistent, GPU-resident program. There is only one kernel launch for the whole sequence. This lets weight streaming run uninterrupted across kernel boundaries, and sampling stays on-GPU. We also rebuilt grid synchronization. Instead of a grid-wide barrier with HBM round-trips, each compute unit waits only on the values it actually depends on, with the readiness state being encoded directly in the data. On the AMD MI300X GPU that took the barrier from ~7 µs to under 1 µs. Before that, grid sync had been eating ~35% of token-generation time! 📖 Deep dive for the MI300X → blog.kog.ai/building-a-sin…

English

502

Bleys Goodson@bleysg·2d

@OfirPress Good hot take. The more interesting reveal is that your minimal harness shows which models are more robustly trained to work with just bash to achieve comparable or better results to more complex harnesses on novel, complex SWE tasks.

English

101

Ofir Press@OfirPress·3d

mini-swe-agent is our ~150 line of code agent, and it performs as well as Claude Code, Codex and Gemini CLI on this new coding benchmark.

Kilian Lieret@KLieret

DeepSWE finds that mini-swe-agent significantly outperforms ClaudeCode and Codex on the benchmark. The simpler the system, the better it generalizes (and mini's core agent class is just ~150 lines of code)

English

9.7K

Bleys Goodson@bleysg·3d

Thanks much for the models and detailed writeup! I know DeepSWE is a new benchmark corpus, but just wanted to make sure it's on your team's radar as it points to a real practical gap in the M2 series. Hoping M3 can become a front-runner in this capability space. deepswe.datacurve.ai/blog

English

564

RyanLee@RyanLeeMiniMax·4d

Recently, we took time to consolidate all of the work behind M2 and published it here: our M2 paper on arXiv It’s been just over six months since we first open-sourced M2 on December 23 last year. During that time, a number of our ideas and systems have been broadly adopted by the open-source community — including CISPO, Forge RL System, Self-Evolution. Over the past six months, we’ve felt incredible enthusiasm from the open-source community. Nearly every model release reached the #1 spot on the Hugging Face leaderboard. Now it’s time for a new chapter. We’re getting ready for M3. MSA paper is on the road. arxiv.org/abs/2605.26494

English

667

182.2K

Bleys Goodson@bleysg·3d

@badlogicgames Some form of sparse attention is essentially a requirement to economically serve 1M token contexts, so yes they most certainly are. The only real question is what flavor.

English

113

Mario Zechner@badlogicgames·4d

this is going to be super duper interesting! i wonder what sparse attention methods, if any, the closed big labs use. from the outside it looks like the open weights labs are innovating hard here. which is great for us plebs.

Skyler Miao@SkylerMiao7

Something BIG is coming

English

153

12.8K

Bleys Goodson@bleysg·3d

@hungtran Are they planning to open the weights or is this operating on the assumption that it’s derivative of Qwen3.5-397B?

English

hung@hungtran·4d

new open-weight sota model on vals index

Vals AI@ValsAI

Qwen 3.7 Max is Alibaba's latest reasoning model ranking 5th on the Vals Index with a score of 57.3%. We ran it across our full benchmark suite. Full results below

English

211

Bleys Goodson@bleysg·4d

REAP on Kimi K2.6 is likely how Cerebras is managing to pilot the model today on WSE-3 without untenable costs. Not serving full-weights.

William Obino@ObinoWilliam

Looking forward to this deep dive by Cerebras! luma.com/reap?tk=SVrhb2

English

210

Bleys Goodson@bleysg·4d

Nice to see activation being addressed directly for FP8-stability. Though the framing in the paper that it is *the* fix is a bit strong, given DeepSeek V4 shows you can scale FP8+FP4 training to 1.6T SwiGLU-Clip-style clamping just with a routing trick. The paper sidesteps that elephant-in-the-room and pretends the DS4 approach doesn't exist. The real question to answer on a like-for-like basis from here is whether DeepSeek-style SwiGLU + QAT and routing tricks for spike control are superior to PowLU. Or perhaps, whether PowLU is a superior swap in place of clamped SwiGLU atop DeepSeek's activation strategy. That seems like the right approach to me. We still want QAT and anticipatory routing. Now we need to uncover whether PowLU has healthy gradient interactions with QAT and whether the compute cost of PowLU in low precision provides enough marginal advantages vs the faster alternative.

Ant Ling@AntLingAGI

SwiGLU is everywhere in modern LLMs — but for large inputs it behaves like x². That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes. We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵

English

135

Bleys Goodson@bleysg·4d

@antirez x.com/SkylerMiao7/st…

Skyler Miao@SkylerMiao7

Something BIG is coming

QME

166

antirez@antirez·4d

I implemented MiMo 2.5 (very fast inference, too) in DwarfStar, including tool calling in ds4-agent. It is a very nice model, but I tried many hand-written tests with GPT 5.5 as a judge among it and DeepSeek v4 Flash. I used Frank's GGUF. MiMo lost every test. Either I have an inference bug that seems really non obvious as the model behaves normally, or Frank uses it for very different things. I want a strong candidate for DwarfStart to add it as alternative model, so that if you have two 128GB systems you can run a multi-agent protocol of some kind. So far MiMo V2.5 and Minimax V2.7 seem weaker than DS4F *regardless of the benchmarks*.

Frank@jedisct1

I’ve just released MiMo V2.5-Coder. If you have 128 GB of RAM, this is one of the best models you can run locally. It’s fast, and in all my experiments it outperformed Qwen 3.6 and DeepSeek 4-Flash. huggingface.co/jedisct1/MiMo-…

English

161

22.6K

Bleys Goodson@bleysg·4d

@antirez MiniMax has started teasing V3 features recently, so maybe a release on the horizon would be a good fit.

English

388

Bleys Goodson@bleysg·4d

@OGALANGLEY @yacineMTB Check out github.com/Entrpi/eemicro… you don't even need a GPU.

English

128

$1,776@OGALANGLEY·4d

@yacineMTB What kind of tiny models are you training that take less than a minute? I have a 6000 pro at home and never reached that on a full run

English

1.1K

kache@yacineMTB·4d

if you're doing AI research at all; I recommend doing the "ETH zurich" route Train models that use a single GPU. Make sure that it takes less than a minute to train models. Pufferlib is a great example. The more models you train the more you learn

Super PINTO@PINTO03091

BiternionNet、１分で学習が終わってしまったんだが。

English

178

3.6K

248.6K

Bleys Goodson@bleysg·4d

@METR_Evals @cerebras @NVIDIAAI @huggingface @Teknium Kimi K2.6 is notable here because it is being served by Cerebras in an enterprise pilot today, but it doesn't really look like an economical choice relative to the typical serving realities today, hence why that capability is in the next-gen projection zone instead.

English

Bleys Goodson@bleysg·4d

I have been investigating LLM serving economics and here are my METR-anchored projections of model capabilities as Vera Rubin (+ comparable TPUs) and next-gen Cerebras are deployed over the next couple of years. Relevant background: 10 years managing large, global R&D datacenter efficiency and the last 3 years working every angle of LLM engineering. (v2 post)

English

123

Bleys Goodson@bleysg·4d

The chart also hints at a new direction for pricing as Blackwell and Vera Rubin class hardware grows dominant for inference: Providers can and will offer the same models at a range of speeds, passing on the inference and opportunity cost multiple. We're just starting to see this with the fast tier rollouts, but as capacity allows new tiers will become available. Capacity is what's holding back these rollouts, as you can generally serve a lot more people and total tokens with the same hardware at lower per user tok/s. There needs to be excess capacity for it to be practical to offer, even with large premiums.

English

Bleys Goodson@bleysg·4d

These are conservative projections based on the 7-month METR-doubling cadence, anchored in actual hardware serving realities.

English

Bleys Goodson@bleysg·4d

@cerebras @ArtificialAnlys @Kimi_Moonshot I've updated my 2028 hardware realities projections chart!

English

Cerebras@cerebras·19 May

Cerebras is now running Kimi K2.6 – a trillion parameter model – in enterprise trials. At ~1,000 tokens/s, this is the fastest frontier model performance ever measured by Artificial Analysis @ArtificialAnlys.

English

172

333

4.3K

848.4K

Bleys Goodson@bleysg·22 Nis

@mckaywrigley This is great! I'm exploring some new UI concepts for this. What do you think?

English

Mckay Wrigley@mckaywrigley·27 Mar

Prompts just got more powerful. Chatbot UI now has prompt templates complete with support for prompt variables. Come save all of your custom prompts for easy reuse. GitHub: github.com/mckaywrigley/c…

English

135

1.3K

262.5K

Bleys Goodson@bleysg·9 Şub

Verifying myself: I am bleys on Keybase.io. vCGBk_qW27fqmIs_T4qOOVvOAJ9OXSBmIO2D / keybase.io/bleys/sigs/vCG…

Suomi

Bleys Goodson@bleysg·16 Oca

hop in the ML pool, the water's warm: http://cli.gs/68tpH3

English

Bleys Goodson@bleysg·12 Ara

@meawoppl: Saw you liked Sage, wonder if you follow Lambda the Ultimate - http://cli.gs/NgdQYE

English

Entdecken

@OfirPress @badlogicgames @hungtran @antirez @OGALANGLEY @yacineMTB @METR_Evals @cerebras