Bleys Goodson

30 posts

Bleys Goodson banner
Bleys Goodson

Bleys Goodson

@bleysg

Helping people engineer the future.

SF / LA Beigetreten Mart 2009
400 Folgt50 Follower
Bleys Goodson
Bleys Goodson@bleysg·
@OfirPress Good hot take. The more interesting reveal is that your minimal harness shows which models are more robustly trained to work with just bash to achieve comparable or better results to more complex harnesses on novel, complex SWE tasks.
English
0
0
0
101
Bleys Goodson
Bleys Goodson@bleysg·
Thanks much for the models and detailed writeup! I know DeepSWE is a new benchmark corpus, but just wanted to make sure it's on your team's radar as it points to a real practical gap in the M2 series. Hoping M3 can become a front-runner in this capability space. deepswe.datacurve.ai/blog
English
1
0
1
564
RyanLee
RyanLee@RyanLeeMiniMax·
Recently, we took time to consolidate all of the work behind M2 and published it here: our M2 paper on arXiv It’s been just over six months since we first open-sourced M2 on December 23 last year. During that time, a number of our ideas and systems have been broadly adopted by the open-source community — including CISPO, Forge RL System, Self-Evolution. Over the past six months, we’ve felt incredible enthusiasm from the open-source community. Nearly every model release reached the #1 spot on the Hugging Face leaderboard. Now it’s time for a new chapter. We’re getting ready for M3. MSA paper is on the road. arxiv.org/abs/2605.26494
RyanLee tweet media
English
32
87
667
182.2K
Bleys Goodson
Bleys Goodson@bleysg·
@badlogicgames Some form of sparse attention is essentially a requirement to economically serve 1M token contexts, so yes they most certainly are. The only real question is what flavor.
English
0
0
1
113
Mario Zechner
Mario Zechner@badlogicgames·
this is going to be super duper interesting! i wonder what sparse attention methods, if any, the closed big labs use. from the outside it looks like the open weights labs are innovating hard here. which is great for us plebs.
Skyler Miao@SkylerMiao7

Something BIG is coming

English
5
3
153
12.8K
Bleys Goodson
Bleys Goodson@bleysg·
@hungtran Are they planning to open the weights or is this operating on the assumption that it’s derivative of Qwen3.5-397B?
English
0
0
0
4
Bleys Goodson
Bleys Goodson@bleysg·
Nice to see activation being addressed directly for FP8-stability. Though the framing in the paper that it is *the* fix is a bit strong, given DeepSeek V4 shows you can scale FP8+FP4 training to 1.6T SwiGLU-Clip-style clamping just with a routing trick. The paper sidesteps that elephant-in-the-room and pretends the DS4 approach doesn't exist. The real question to answer on a like-for-like basis from here is whether DeepSeek-style SwiGLU + QAT and routing tricks for spike control are superior to PowLU. Or perhaps, whether PowLU is a superior swap in place of clamped SwiGLU atop DeepSeek's activation strategy. That seems like the right approach to me. We still want QAT and anticipatory routing. Now we need to uncover whether PowLU has healthy gradient interactions with QAT and whether the compute cost of PowLU in low precision provides enough marginal advantages vs the faster alternative.
Ant Ling@AntLingAGI

SwiGLU is everywhere in modern LLMs — but for large inputs it behaves like x². That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes. We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵

English
0
0
0
135
antirez
antirez@antirez·
I implemented MiMo 2.5 (very fast inference, too) in DwarfStar, including tool calling in ds4-agent. It is a very nice model, but I tried many hand-written tests with GPT 5.5 as a judge among it and DeepSeek v4 Flash. I used Frank's GGUF. MiMo lost every test. Either I have an inference bug that seems really non obvious as the model behaves normally, or Frank uses it for very different things. I want a strong candidate for DwarfStart to add it as alternative model, so that if you have two 128GB systems you can run a multi-agent protocol of some kind. So far MiMo V2.5 and Minimax V2.7 seem weaker than DS4F *regardless of the benchmarks*.
Frank@jedisct1

I’ve just released MiMo V2.5-Coder. If you have 128 GB of RAM, this is one of the best models you can run locally. It’s fast, and in all my experiments it outperformed Qwen 3.6 and DeepSeek 4-Flash. huggingface.co/jedisct1/MiMo-…

English
17
11
161
22.6K
Bleys Goodson
Bleys Goodson@bleysg·
@antirez MiniMax has started teasing V3 features recently, so maybe a release on the horizon would be a good fit.
English
0
0
1
388
$1,776
$1,776@OGALANGLEY·
@yacineMTB What kind of tiny models are you training that take less than a minute? I have a 6000 pro at home and never reached that on a full run
English
1
0
0
1.1K
Bleys Goodson
Bleys Goodson@bleysg·
@METR_Evals @cerebras @NVIDIAAI @huggingface @Teknium Kimi K2.6 is notable here because it is being served by Cerebras in an enterprise pilot today, but it doesn't really look like an economical choice relative to the typical serving realities today, hence why that capability is in the next-gen projection zone instead.
English
0
0
1
84
Bleys Goodson
Bleys Goodson@bleysg·
I have been investigating LLM serving economics and here are my METR-anchored projections of model capabilities as Vera Rubin (+ comparable TPUs) and next-gen Cerebras are deployed over the next couple of years. Relevant background: 10 years managing large, global R&D datacenter efficiency and the last 3 years working every angle of LLM engineering. (v2 post)
Bleys Goodson tweet media
English
3
0
1
123
Bleys Goodson
Bleys Goodson@bleysg·
The chart also hints at a new direction for pricing as Blackwell and Vera Rubin class hardware grows dominant for inference: Providers can and will offer the same models at a range of speeds, passing on the inference and opportunity cost multiple. We're just starting to see this with the fast tier rollouts, but as capacity allows new tiers will become available. Capacity is what's holding back these rollouts, as you can generally serve a lot more people and total tokens with the same hardware at lower per user tok/s. There needs to be excess capacity for it to be practical to offer, even with large premiums.
English
0
0
1
77
Bleys Goodson
Bleys Goodson@bleysg·
These are conservative projections based on the 7-month METR-doubling cadence, anchored in actual hardware serving realities.
English
0
0
0
39
Cerebras
Cerebras@cerebras·
Cerebras is now running Kimi K2.6 – a trillion parameter model – in enterprise trials. At ~1,000 tokens/s, this is the fastest frontier model performance ever measured by Artificial Analysis @ArtificialAnlys.
Cerebras tweet media
English
172
333
4.3K
848.4K
Bleys Goodson
Bleys Goodson@bleysg·
@mckaywrigley This is great! I'm exploring some new UI concepts for this. What do you think?
Bleys Goodson tweet media
English
0
0
0
8
Mckay Wrigley
Mckay Wrigley@mckaywrigley·
Prompts just got more powerful. Chatbot UI now has prompt templates complete with support for prompt variables. Come save all of your custom prompts for easy reuse. GitHub: github.com/mckaywrigley/c…
English
64
135
1.3K
262.5K
Bleys Goodson
Bleys Goodson@bleysg·
hop in the ML pool, the water's warm: http://cli.gs/68tpH3
English
0
0
0
0
Bleys Goodson
Bleys Goodson@bleysg·
@meawoppl: Saw you liked Sage, wonder if you follow Lambda the Ultimate - http://cli.gs/NgdQYE
English
0
0
0
0