
relu
1.1K posts



Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping moved. gradient G has SVD G = U Sigma V^T. Muon's update is just U V^T. keep the directions, throw away the magnitudes


Next read is on how to generalize GRPO to domains without verifiers









Residual connections and pre-norm are not the whole story behind depth utilization. Our new paper shows that many seemingly different design choices — MoE, grouped-query attention, weight decay, and longer sequence length — can be understood through one unifying lens: sparsity. These components induce different forms of sparsity, which reduce output variance and in turn preserve healthier gradient flow across depth. Strikingly, these techniques also complement each other remarkably well: when combined, they lead to substantial improvements in depth utilization and notable gains in downstream accuracy. Paper page: pumpkin-co.github.io/SparsityAndCoD/ Arxiv: arxiv.org/pdf/2603.15389 Leading by @pumpkinnnnne

was messing with the OpenAI base URL in Cursor and caught this accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast so composer 2 is just Kimi K2.5 with RL at least rename the model ID

Costco CEO Ron Vachris did the “CEO eats his own product” challenge by destroying a hot dog (and confirms the Costco hot dog combo is staying at $1.50 forever). Legend.


🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵



Chess is 30 years ahead of every other profession in dealing with AI. The best case study we have for what's coming. 4 lessons: 1. Human-AI collaboration had a 15-year shelf life in chess. "Human in the loop" is a phase.












