Pollux9437

1.3K posts

Pollux9437

@Orange41324306

Agent @yipitdata, RL Infra and pretrain research at night, sometimes photography and poetry. writer account @orange97648

Santa Clara, CA Katılım Ocak 2020

1.8K Takip Edilen981 Takipçiler

Pollux9437 retweetledi

Zhepei Wei@weizhepei·1d

😢RLVR is powerful but expensive 🤯Imagine using <20% RLVR training while achieving 100% performance? Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost! 📃tinyurl.com/minimal-rlvr 🧵[1/n]

English

170

14.5K

Pollux9437 retweetledi

Zhuokai Zhao@zhuokaiz·23h

Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL. But in practice it is surprisingly brittle to make work — much more so than SFT or RL. Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment. OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks. The real object OPD is learning on At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens. One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options. Here is what I mean by "the handful of tokens." For a given prefix like: "Let's solve this step by step. First, we…" the student may think the next token should be one of: "need", "can", "have", "find", "know", "compute", … The teacher has its own version of this menu. OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu. It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions. Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update. 1. The student and teacher are thinking in different "languages" A stronger teacher does not necessarily make a better OPD teacher. Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD. Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices. So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path. The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family. Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior. This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern. So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past. Benchmark gap does not tell you whether OPD will work. Token-level compatibility does. 2. Repetition becomes locally rewarding Even if OPD starts well, it can still collapse. The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses. The mechanism is counterintuitive at first but makes sense once you see it. In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal. Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context. This prefix is bad globally. But locally, it is very predictable. A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio. That means OPD accidentally rewards continuing the repetition. Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse. Then the loop reinforces itself: more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them. 3. We often only look at one item on the menu The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version: Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away. That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary. There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards. Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive. The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick. This creates three problems. First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case. Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality. Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine. The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there. It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?" That is a much better local learning signal. They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches. 4. Even good token signals may not add up This is the least developed of the four failure modes, and possibly the most important. Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case. So the reward signal is globally informative. It just does not produce useful gradients. Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update. Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say. The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined. The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable? What this explains A lot of practical OPD fixes start to look related. SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins. Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback. KL regularization helps because it prevents the student from drifting too quickly into weird generations. Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage. Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference. These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate. The ceiling, and the real open question The more interesting part is long-horizon reasoning. The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable. One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K. That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world. OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction. This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world. My current takeaway OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction. References [1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arxiv.org/abs/2604.13016 [2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arxiv.org/abs/2604.08527 [3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. arxiv.org/abs/2603.25562

English

383

52.8K

Pollux9437 retweetledi

AVB@neural_avb·2d

x.com/i/article/2054…

ZXX

150

38K

Pollux9437@Orange41324306·1d

yeah, and the desire to be "out of distribution"

Xin Eric Wang (hiring postdoc)@xwang_lk

Research taste, technical depth, writing skills, etc., can all be trained. Curiosity, self-motivation, and integrity are far harder to teach. Those traits matter more than people think.

English

Pollux9437@Orange41324306·1d

I'm so glad!!! Leetcode does not solve any real enterprise level problems

Yuchen Jin@Yuchenj_UW

I’m so glad AI killed LeetCode interviews. For 10 years, tech companies made every engineer grind the same puzzles and prove they could invert a binary tree from memory. Today, the dumbest AI model can walk in and one-shot the entire interview. Thank you, AI.

English

161

Pollux9437 retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·1d

Wow, fixing one simple parameter could stop your AI training from collapsing! UCL, Shanghai Jiao Tong University, and HKUST (Guangzhou) present HölderPO. Instead of summing token probabilities in a fixed way, HölderPO uses a flexible averaging trick controlled by a single “p” knob. Large p amplifies rare but important signals; small p keeps things stable. The paper also schedules p to change over time. It beats standard GRPO by 7.2% on math benchmarks (54.9% average) and hits 93.8% success rate on ALFWorld — with better stability and convergence.

English

9.3K

Pollux9437 retweetledi

elvis@omarsar0·3d

Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment? That is the idea behind Lighthouse Attention. The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically, preserving left-to-right causality. Crucially, it can be removed near the end of training in a short recovery phase, so the deployed model still runs vanilla attention with no architectural cost at inference. Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines. Why does it matter? Most efficient-attention work either changes the deployment-time architecture or pays a quality tax to do so. A training-only wrapper that survives a clean recovery phase sidesteps both. If it scales, this becomes an important training-time speedup for long-context pretraining. Paper: arxiv.org/abs/2605.06554 Learn to build effective AI agents in our academy: academy.dair.ai

English

299

68.3K

Pollux9437 retweetledi

Kevin Li@kevin_x_li·2d

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alien…

English

514

74K

Pollux9437@Orange41324306·1d

最近越来越强烈地感觉，AI 的能力已经超过了很多普通能力的员工，如果没有好奇和探索精神，人很容易慢慢退化成 “AI proxy”：问他问题，他转头就去问ai，转发 AI 的输出。你继续追问他的判断和结论，他继续转发更多 token。如果不回复，或者告知他下一步，问题就一直卡在那里了。

中文

112

Pollux9437 retweetledi

Nous Research@NousResearch·2d

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

English

147

414

3.7K

423K

Pollux9437 retweetledi

ChengSong Huang@ChengsongH31219·3d

"How do you self-improve a model on open-ended tasks where you can't take a majority vote?" I got asked this in nearly every research interview I did last year. None of my answers felt clean. So we built something that doesn't need a vote, a verifier, or a judge. Meet G-Zero. 👇 paper: arxiv.org/abs/2605.09959 huggingface: huggingface.co/papers/2605.09… code: github.com/Chengsong-Huan… All experiments are done via api by @thinkymachines (1/n)

English

242

14.3K

Pollux9437@Orange41324306·2d

26/05 记买了一张 rtx 6000 pro

中文

219

Pollux9437 retweetledi

wh@nrehiew_·5d

New blog post! Wrote about how SFT, RL, OPD relate to generalization and catastrophic forgetting :)

wh@nrehiew_

x.com/i/article/2053…

English

106

191K

Pollux9437 retweetledi

LightSeek Foundation@lightseekorg·6 May

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source github.com/lightseekorg/t… lightseek.org/blog/lightseek…

English

124

1.1K

1.7M

Pollux9437 retweetledi

varun@varunneal·4 May

@Pluralis I really like this paper: openreview.net/forum?id=DuNf2… Also proposes a form of "arch warmup" where decoder blocks are frozen at zero, which they claim is more effective for controlling stability than QK-Norm!

English

10.1K

Pollux9437 retweetledi

Tomasz Limisiewicz@TomLimi·4 May

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

596

85.2K

Pollux9437 retweetledi

Wentao Guo@WentaoGuo7·19 Ara

🚀SonicMoE🚀: a blazingly-fast MoE implementation optimized for NVIDIA Hopper GPUs. SonicMoE reduces activation memory by 45% and is 1.86x faster on H100 than previous SOTA😃 Paper: arxiv.org/abs/2512.14080 Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao

English

112

643

247.4K

Pollux9437 retweetledi

Will Bui@will_ea·3 May

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

770

73.9K

Pollux9437@Orange41324306·3 May

@MathewShen42 让我想个办法多发发测结构的结果

中文

Mathew Shen@MathewShen42·3 May

@Orange41324306 恭喜恭喜

日本語

Pollux9437@Orange41324306·3 May

与朋友聊天，忽然发现自己已经过上了（甚至）他们梦想中的生活：做做pretrain测结构，写写kernel改infra，而不是有一个researcher/scientist的job title做data cleaning，忽然狂喜

中文

285

Pollux9437 retweetledi

Underfox@Underfox3·2 May

In this paper is proposed RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. arxiv.org/pdf/2604.27085

English

102

12.9K

Keşfet

@bloc97_ @gigant_theo @theemozilla @thinkymachines @Pluralis @MayankMish98 @XinleC295 @istoica05