rico

2.7K posts

rico

rico

@b_rich_now

@StepFun_ai

Katılım Eylül 2020
3.4K Takip Edilen223 Takipçiler
rico retweetledi
Yuxiang Huang
Yuxiang Huang@yxyxyyy6·
[1/n] Can a model learn *where* and *how much* information it should attend to, and do so efficiently? We introduce DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention! This pushes the accuracy-efficiency frontier in LLMs.
GIF
English
1
13
69
15.5K
rico retweetledi
Cleo Abram
Cleo Abram@cleoabram·
The Demis Hassabis HUGE* Conversation (in full) 00:00 What is the hardest problem AI has already solved? 12:30 What is the cutting edge of drug discovery with AI? 21:53 Why did Demis say he “would have left AI in the lab longer”? 43:09 How should militaries use AI? 50:13 What can humans do that AI won't? 58:17 What does Demis Hassabis want his legacy to be? (And 1:04:40 Can I beat Demis at Jenga?) Recorded March 5, 2026 in London.
English
30
282
1.8K
451.1K
rico retweetledi
Rosinality
Rosinality@rosinality·
arxiv.org/abs/2605.20285 Annotating the data with quality labels and prefix conditioning using these labels during pretraining (or any other stage). Old idea but score changes here are powerful.
Rosinality tweet mediaRosinality tweet media
English
0
9
72
3.5K
rico
rico@b_rich_now·
@YifeiZuoX how about grad norm 😍
English
1
0
1
382
Yifei Zuo
Yifei Zuo@YifeiZuoX·
Always a pleasure when the long, messy work of pretraining research resolves into curves this satisfying
Yifei Zuo tweet mediaYifei Zuo tweet media
English
2
5
151
73.3K
rico retweetledi
Konstantin Mishchenko
Konstantin Mishchenko@konstmish·
That's a nice paper, very neat.
Konstantin Mishchenko tweet media
English
2
29
159
19.9K
rico retweetledi
Maria Esteban
Maria Esteban@Maria__Esteban·
🏎️Drift in the right direction🏎️ Introducing kernel-gradient drifting models: a reformulation of drifting models where the kernel itself defines the direction of motion through its gradient. 📜Paper: arxiv.org/pdf/2605.10727 💾Notebook: tinyurl.com/mv2jhuky
GIF
English
4
37
176
18.4K
rico retweetledi
Jiarui Liu
Jiarui Liu@Jiarui_Liu_·
Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting. 📄arxiv.org/abs/2605.16865 Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down. MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention. Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.
Jiarui Liu tweet media
English
3
25
105
8.8K
rico retweetledi
Beidi Chen
Beidi Chen@BeidiChen·
Align with how @cursor_ai has done its RL stage — Astraflow is a new RL engine that enables asynchronous, heterogeneous, and geo-distributed RL in a native way through dataflow abstraction~ Like @FireworksAI_HQ’s sparse RL transfer design, it syncs only ≤1.1% of model weights — making remote rollout lightweight and efficient. Check it out!!!
Infini-AI-Lab@InfiniAILab

We’re excited to release 𝐀𝐬𝐭𝐫𝐚𝐅𝐥𝐨𝐰, an open-source, dataflow-oriented RL system for training multi-agentic and multi-policy LLMs. 🚀 Built for scalable, flexible, and efficient agent RL, AstraFlow natively enables: ⚡ 𝟐.𝟕× 𝐟𝐚𝐬𝐭𝐞𝐫 𝐦𝐮𝐥𝐭𝐢-𝐩𝐨𝐥𝐢𝐜𝐲 𝐚𝐠𝐞𝐧𝐭𝐬 𝐜𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞 𝐑𝐋 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 Achieves comparable or better accuracy than verl-based baseline. 🌍 𝐙𝐞𝐫𝐨-𝐜𝐨𝐝𝐞 𝐬𝐲𝐬𝐭𝐞𝐦 𝐟𝐥𝐞𝐱𝐢𝐛𝐢𝐥𝐢𝐭𝐲 Supports elastic multi-policy training and cross-region rollout across heterogeneous GPUs. 📦 ≤𝟏.𝟏% 𝐬𝐩𝐚𝐫𝐬𝐞 𝐭𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐟𝐨𝐫 𝐫𝐞𝐦𝐨𝐭𝐞 𝐫𝐨𝐥𝐥𝐨𝐮𝐭 Same to @FireworksAI_HQ’s sparse RL transfer design, AstraFlow cuts sync from ~28 GB to ~1.5 GB, with deltas ≤1.1% of weights, making remote rollout lightweight and efficient: fireworks.ai/blog/frontier-… 🔁 𝐒𝐮𝐛𝐬𝐭𝐢𝐭𝐮𝐭𝐚𝐛𝐥𝐞 𝐫𝐨𝐥𝐥𝐨𝐮𝐭 𝐚𝐧𝐝 𝐭𝐫𝐚𝐢𝐧𝐞𝐫 𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 Provides modular rollout and training components for flexible deployment. 🧵(1/5)

English
3
29
207
31.4K
rico retweetledi
Xavier Gonzalez
Xavier Gonzalez@xavierjgonzalez·
Fixed point iterations for parallelizing nonlinear dynamics is all the rage: - Newton for RNNs - Picard for diffusion models - Jacobi for parallel decode of LLMs But how do these techniques relate, and when should you use them? We show you how in our new paper 🧵
English
6
27
169
19.1K
rico retweetledi
Jiaxin Wen
Jiaxin Wen@jiaxinwen22·
New post: "Generalization Dynamics of LM Pre-training" Most people (including me) assume that LMs smoothly mature from pattern-matching to generalizing. This mental model is wrong. The true dynamics are stranger, and far more fascinating! We call it Mode-Hopping.
English
10
76
454
51.1K
MikaStars★
MikaStars★@MikaStars39·
a late update: I've joined @MiniMax_AI post-training team. Working on M3!🐙
English
43
7
362
18.1K
rico retweetledi
Zeyi(Andy) Liu
Zeyi(Andy) Liu@ZeyiAndyLiu·
New paper: Spectral Lens Loss curves can hide how LLMs actually learn. We show that activation and gradient spectra reveal hidden representation geometry, predict token efficiency early, and distinguish learning gains from throughput gains. arxiv.org/abs/2605.05683
Zeyi(Andy) Liu tweet media
English
5
34
231
13.3K
rico retweetledi
Nous Research
Nous Research@NousResearch·
Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context. It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss. During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model. Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism. The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.
Nous Research tweet media
English
52
230
2K
155.5K
rico retweetledi
Nabil Iqbal
Nabil Iqbal@nblqbl·
For the physicists: it seems a "gapless" mode is good for learning. Tuning to a critical point is one way to make one, and is the usual "edge of chaos" in deep learning. But having a Goldstone mode also works. arxiv link: arxiv.org/pdf/2605.14685 Let me know your thoughts!
English
1
4
52
1.9K
rico retweetledi
Aaron Spieler
Aaron Spieler@AaronSpieler·
1) I'm excited to present "Scaling Laws and Tradeoffs in Recurrent Network of Expressive Neurons" 🎉, where we question the optimality of simple neurons for network performance, and capture scaling behaviours using a minimal information theoretic model: arXiv.org/abs/2605.12049
Aaron Spieler tweet media
English
6
21
93
5.5K
rico retweetledi
Tony S.F.
Tony S.F.@tonysilveti·
@norxornor @ShikaiQiu Totally agree with this. We actually did exactly this experiment vs muP and found the same T^{-1/3} factor (if you also scale (batch size * seq len) appropriately. Just doing muP is not enough when scaling data.
Tony S.F. tweet media
English
0
3
14
479
rico retweetledi
Shenyang Deng ✈️ ICML2026
Shenyang Deng ✈️ ICML2026@DengShenyang24·
1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵
English
3
17
105
15.3K
rico retweetledi
Paria Rashidinejad
Paria Rashidinejad@paria_rd·
Looped Transformers: the dream was right. But there was trouble in paradise. The loop made them unstable, expensive, and memory-hungry, with gains hard to scale. So we asked: 𝗖𝗮𝗻 𝘄𝗲 𝗿𝗲𝗮𝗽 𝘁𝗵𝗲 𝗿𝗲𝘄𝗮𝗿𝗱𝘀 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗽𝗮𝘆𝗶𝗻𝗴 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 𝘁𝗮𝘅? Introducing 𝗔𝘁𝘁𝗿𝗮𝗰𝘁𝗼𝗿 𝗠𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗮𝗻𝗱 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: • A Backbone proposes an initial “guess” output embedding; • An Attractor refines it: a fixed-point solver lets the model “think” before each token. Implicit differentiation trains the model stably, with constant memory and without BPTT. Training also revealed a surprising phenomenon: 𝗘𝗾𝘂𝗶𝗹𝗶𝗯𝗿𝗶𝘂𝗺 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Over the course of training, the Backbone learns to propose latents close to the equilibrium itself, making the Attractor almost unnecessary at inference. Results: • 𝗣𝗮𝗿𝗲𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴: up to 𝟰𝟲.𝟲% lower perplexity and 𝟭𝟵.𝟳% better downstream accuracy. A 770M Attractor Model beats a 1.3B Transformer, despite being trained on half as many tokens. • 𝗦𝗶𝗴𝗻𝗶𝗳𝗶𝗰𝗮𝗻𝘁 𝗴𝗮𝗶𝗻𝘀 𝗼𝗻 𝗵𝗮𝗿𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝘁𝗮𝘀𝗸𝘀: a 27M Attractor Model trained on only 1K examples achieves 𝟵𝟭.𝟰% 𝗼𝗻 𝗦𝘂𝗱𝗼𝗸𝘂-𝗘𝘅𝘁𝗿𝗲𝗺𝗲 and 𝟵𝟯.𝟭% 𝗼𝗻 𝗠𝗮𝘇𝗲-𝗛𝗮𝗿𝗱, while Transformers and frontier models like Claude and GPT o3 score 𝟬%. 📝 arxiv.org/pdf/2605.12466 🧵 1/10
Paria Rashidinejad tweet media
English
19
89
591
63.6K
rico retweetledi
Zihao Zhao
Zihao Zhao@zhaokevin1012·
Excited to share our paper "A Fully First-Order Layer for Differentiable Optimization" was accepted by #ICML2026 with spotlight! We propose FFOLayer, a differentiable optimization layer that makes the backward pass fully first-order. Try our code: github.com/GT-KOALA/FFOLa… (1/n)
Zihao Zhao tweet media
English
2
31
168
14.7K