rico

2.7K posts

rico

@b_rich_now

@StepFun_ai

Katılım Eylül 2020

3.4K Takip Edilen223 Takipçiler

Sabitlenmiş Tweet

rico@b_rich_now·5 Şub

github.com/stepfun-ai/Ste…

ZXX

494

rico retweetledi

Yuxiang Huang@yxyxyyy6·8h

[1/n] Can a model learn *where* and *how much* information it should attend to, and do so efficiently? We introduce DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention! This pushes the accuracy-efficiency frontier in LLMs.

GIF

English

15.5K

rico retweetledi

Cleo Abram@cleoabram·1d

The Demis Hassabis HUGE* Conversation (in full) 00:00 What is the hardest problem AI has already solved? 12:30 What is the cutting edge of drug discovery with AI? 21:53 Why did Demis say he “would have left AI in the lab longer”? 43:09 How should militaries use AI? 50:13 What can humans do that AI won't? 58:17 What does Demis Hassabis want his legacy to be? (And 1:04:40 Can I beat Demis at Jenga?) Recorded March 5, 2026 in London.

English

282

1.8K

451.1K

rico retweetledi

Rosinality@rosinality·15h

arxiv.org/abs/2605.20285 Annotating the data with quality labels and prefix conditioning using these labels during pretraining (or any other stage). Old idea but score changes here are powerful.

English

3.5K

rico@b_rich_now·21h

@YifeiZuoX how about grad norm 😍

English

382

Yifei Zuo@YifeiZuoX·1d

Always a pleasure when the long, messy work of pretraining research resolves into curves this satisfying

English

151

73.3K

rico retweetledi

Konstantin Mishchenko@konstmish·1d

That's a nice paper, very neat.

English

159

19.9K

rico retweetledi

Maria Esteban@Maria__Esteban·1d

🏎️Drift in the right direction🏎️ Introducing kernel-gradient drifting models: a reformulation of drifting models where the kernel itself defines the direction of motion through its gradient. 📜Paper: arxiv.org/pdf/2605.10727 💾Notebook: tinyurl.com/mv2jhuky

GIF

English

176

18.4K

rico retweetledi

Jiarui Liu@Jiarui_Liu_·2d

Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting. 📄arxiv.org/abs/2605.16865 Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down. MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention. Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.

English

105

8.8K

rico retweetledi

Beidi Chen@BeidiChen·2d

Align with how @cursor_ai has done its RL stage — Astraflow is a new RL engine that enables asynchronous, heterogeneous, and geo-distributed RL in a native way through dataflow abstraction~ Like @FireworksAI_HQ’s sparse RL transfer design, it syncs only ≤1.1% of model weights — making remote rollout lightweight and efficient. Check it out!!!

Infini-AI-Lab@InfiniAILab

We’re excited to release 𝐀𝐬𝐭𝐫𝐚𝐅𝐥𝐨𝐰, an open-source, dataflow-oriented RL system for training multi-agentic and multi-policy LLMs. 🚀 Built for scalable, flexible, and efficient agent RL, AstraFlow natively enables: ⚡ 𝟐.𝟕× 𝐟𝐚𝐬𝐭𝐞𝐫 𝐦𝐮𝐥𝐭𝐢-𝐩𝐨𝐥𝐢𝐜𝐲 𝐚𝐠𝐞𝐧𝐭𝐬 𝐜𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞 𝐑𝐋 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 Achieves comparable or better accuracy than verl-based baseline. 🌍 𝐙𝐞𝐫𝐨-𝐜𝐨𝐝𝐞 𝐬𝐲𝐬𝐭𝐞𝐦 𝐟𝐥𝐞𝐱𝐢𝐛𝐢𝐥𝐢𝐭𝐲 Supports elastic multi-policy training and cross-region rollout across heterogeneous GPUs. 📦 ≤𝟏.𝟏% 𝐬𝐩𝐚𝐫𝐬𝐞 𝐭𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐟𝐨𝐫 𝐫𝐞𝐦𝐨𝐭𝐞 𝐫𝐨𝐥𝐥𝐨𝐮𝐭 Same to @FireworksAI_HQ’s sparse RL transfer design, AstraFlow cuts sync from ~28 GB to ~1.5 GB, with deltas ≤1.1% of weights, making remote rollout lightweight and efficient: fireworks.ai/blog/frontier-… 🔁 𝐒𝐮𝐛𝐬𝐭𝐢𝐭𝐮𝐭𝐚𝐛𝐥𝐞 𝐫𝐨𝐥𝐥𝐨𝐮𝐭 𝐚𝐧𝐝 𝐭𝐫𝐚𝐢𝐧𝐞𝐫 𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 Provides modular rollout and training components for flexible deployment. 🧵(1/5)

English

207

31.4K

rico retweetledi

Xavier Gonzalez@xavierjgonzalez·2d

Fixed point iterations for parallelizing nonlinear dynamics is all the rage: - Newton for RNNs - Picard for diffusion models - Jacobi for parallel decode of LLMs But how do these techniques relate, and when should you use them? We show you how in our new paper 🧵

English

169

19.1K

rico retweetledi

Jiaxin Wen@jiaxinwen22·3d

New post: "Generalization Dynamics of LM Pre-training" Most people (including me) assume that LMs smoothly mature from pattern-matching to generalizing. This mental model is wrong. The true dynamics are stranger, and far more fascinating! We call it Mode-Hopping.

English

454

51.1K

rico@b_rich_now·3d

@MikaStars39 @MiniMax_AI congrats😍

English

113

MikaStars★@MikaStars39·3d

a late update: I've joined @MiniMax_AI post-training team. Working on M3!🐙

English

362

18.1K

rico retweetledi

Zeyi(Andy) Liu@ZeyiAndyLiu·5d

New paper: Spectral Lens Loss curves can hide how LLMs actually learn. We show that activation and gradient spectra reveal hidden representation geometry, predict token efficiency early, and distinguish learning gains from throughput gains. arxiv.org/abs/2605.05683

English

231

13.3K

rico retweetledi

Tony S.F.@tonysilveti·5d

@norxornor @StefanGliga We "fixed" this by doing clipping in this paper arxiv.org/abs/2506.01913 If you are near the minimizer (grad norm small in relevant norm), you will switch to steepest descent.

English

314

rico retweetledi

Nous Research@NousResearch·6d

Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context. It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss. During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model. Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism. The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.

English

230

155.5K

rico retweetledi

Nabil Iqbal@nblqbl·6d

For the physicists: it seems a "gapless" mode is good for learning. Tuning to a critical point is one way to make one, and is the usual "edge of chaos" in deep learning. But having a Goldstone mode also works. arxiv link: arxiv.org/pdf/2605.14685 Let me know your thoughts!

English

1.9K

rico retweetledi

Aaron Spieler@AaronSpieler·6d

1) I'm excited to present "Scaling Laws and Tradeoffs in Recurrent Network of Expressive Neurons" 🎉, where we question the optimality of simple neurons for network performance, and capture scaling behaviours using a minimal information theoretic model: arXiv.org/abs/2605.12049

English

5.5K

rico retweetledi

Tony S.F.@tonysilveti·14 May

@norxornor @ShikaiQiu Totally agree with this. We actually did exactly this experiment vs muP and found the same T^{-1/3} factor (if you also scale (batch size * seq len) appropriately. Just doing muP is not enough when scaling data.

English

479

rico retweetledi

Shenyang Deng ✈️ ICML2026@DengShenyang24·14 May

1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵

English

105

15.3K

rico retweetledi

Paria Rashidinejad@paria_rd·15 May

Looped Transformers: the dream was right. But there was trouble in paradise. The loop made them unstable, expensive, and memory-hungry, with gains hard to scale. So we asked: 𝗖𝗮𝗻 𝘄𝗲 𝗿𝗲𝗮𝗽 𝘁𝗵𝗲 𝗿𝗲𝘄𝗮𝗿𝗱𝘀 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗽𝗮𝘆𝗶𝗻𝗴 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 𝘁𝗮𝘅? Introducing 𝗔𝘁𝘁𝗿𝗮𝗰𝘁𝗼𝗿 𝗠𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗮𝗻𝗱 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: • A Backbone proposes an initial “guess” output embedding; • An Attractor refines it: a fixed-point solver lets the model “think” before each token. Implicit differentiation trains the model stably, with constant memory and without BPTT. Training also revealed a surprising phenomenon: 𝗘𝗾𝘂𝗶𝗹𝗶𝗯𝗿𝗶𝘂𝗺 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Over the course of training, the Backbone learns to propose latents close to the equilibrium itself, making the Attractor almost unnecessary at inference. Results: • 𝗣𝗮𝗿𝗲𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴: up to 𝟰𝟲.𝟲% lower perplexity and 𝟭𝟵.𝟳% better downstream accuracy. A 770M Attractor Model beats a 1.3B Transformer, despite being trained on half as many tokens. • 𝗦𝗶𝗴𝗻𝗶𝗳𝗶𝗰𝗮𝗻𝘁 𝗴𝗮𝗶𝗻𝘀 𝗼𝗻 𝗵𝗮𝗿𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝘁𝗮𝘀𝗸𝘀: a 27M Attractor Model trained on only 1K examples achieves 𝟵𝟭.𝟰% 𝗼𝗻 𝗦𝘂𝗱𝗼𝗸𝘂-𝗘𝘅𝘁𝗿𝗲𝗺𝗲 and 𝟵𝟯.𝟭% 𝗼𝗻 𝗠𝗮𝘇𝗲-𝗛𝗮𝗿𝗱, while Transformers and frontier models like Claude and GPT o3 score 𝟬%. 📝 arxiv.org/pdf/2605.12466 🧵 1/10

English

591

63.6K

rico retweetledi

Zihao Zhao@zhaokevin1012·13 May

Excited to share our paper "A Fully First-Order Layer for Differentiable Optimization" was accepted by #ICML2026 with spotlight! We propose FFOLayer, a differentiable optimization layer that makes the backward pass fully first-order. Try our code: github.com/GT-KOALA/FFOLa… (1/n)

English

168

14.7K

Keşfet

@YifeiZuoX @cursor_ai @FireworksAI_HQ @MikaStars39 @MiniMax_AI @norxornor @StefanGliga @bloc97_