Chenning Li

14 posts

Chenning Li

Chenning Li

@chenningli1117

Katılım Kasım 2025
85 Takip Edilen102 Takipçiler
Chenning Li retweetledi
Xiuyu Li
Xiuyu Li@sheriyuo·
These three papers are indeed the most recent research on OPD. Thank you for summarizing and sharing them. Your write-up is very detailed and provides great insights. The development of OPD nowadays is truly astonishing — just look at how many papers have already been collected in Awesome OPD. Before diving into the latest research on OPD, it is definitely essential to be familiar with these follow-ups. 49 Core OPD Papers, 113 new arXivs in 2026🤯 Awesome OPD: github.com/chrisliu298/aw…
Xiuyu Li tweet media
Zhuokai Zhao@zhuokaiz

Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL. But in practice it is surprisingly brittle to make work — much more so than SFT or RL. Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment. OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks. The real object OPD is learning on At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens. One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options. Here is what I mean by "the handful of tokens." For a given prefix like: "Let's solve this step by step. First, we…" the student may think the next token should be one of: "need", "can", "have", "find", "know", "compute", … The teacher has its own version of this menu. OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu. It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions. Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update. 1. The student and teacher are thinking in different "languages" A stronger teacher does not necessarily make a better OPD teacher. Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD. Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices. So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path. The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family. Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior. This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern. So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past. Benchmark gap does not tell you whether OPD will work. Token-level compatibility does. 2. Repetition becomes locally rewarding Even if OPD starts well, it can still collapse. The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses. The mechanism is counterintuitive at first but makes sense once you see it. In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal. Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context. This prefix is bad globally. But locally, it is very predictable. A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio. That means OPD accidentally rewards continuing the repetition. Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse. Then the loop reinforces itself: more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them. 3. We often only look at one item on the menu The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version: Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away. That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary. There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards. Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive. The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick. This creates three problems. First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case. Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality. Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine. The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there. It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?" That is a much better local learning signal. They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches. 4. Even good token signals may not add up This is the least developed of the four failure modes, and possibly the most important. Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case. So the reward signal is globally informative. It just does not produce useful gradients. Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update. Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say. The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined. The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable? What this explains A lot of practical OPD fixes start to look related. SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins. Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback. KL regularization helps because it prevents the student from drifting too quickly into weird generations. Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage. Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference. These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate. The ceiling, and the real open question The more interesting part is long-horizon reasoning. The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable. One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K. That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world. OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction. This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world. My current takeaway OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction. References [1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arxiv.org/abs/2604.13016 [2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arxiv.org/abs/2604.08527 [3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. arxiv.org/abs/2603.25562

English
0
18
147
14.6K
Tianyi Zhang
Tianyi Zhang@mycharmspace·
Today is my last day at xAI. I joined xAI a year ago and had the pleasure of leading the search and factuality post-training team. Over time, we developed so many recipe and engineering co-optimizations, making Grok the best AI for search and real-time agent. I am also particularly proud of working with a small group of talented people delivering the recent iterations of the instant mode of Grok - the one I personally liked and used the most. My thanks to all the friends and teammates for their support and help over the past year. They are among the brightest minds I’ve met in my career. I am sure the team will continue the mission to make better Grok and understand the universe.
English
84
10
647
83.1K
Chenning Li retweetledi
Stephen Xie
Stephen Xie@stephenx_·
Longer chain-of-thought = slower inference, more context rot, and ballooning compute. So what if the model could decide for itself when to go parallel? Our new BAIR blog breaks down Adaptive Parallel Reasoning (APR) — the next paradigm in inference-time scaling. 🧵
Stephen Xie tweet media
English
14
49
455
40.6K
Chenning Li retweetledi
xAI
xAI@xai·
Grok 4.3 is now live on the xAI API. It’s our fastest, most intelligent model to date. It tops the @ArtificialAnlys leaderboards in agentic tool calling and instruction following, and ranks #1 in @ValsAI enterprise domains like case law and corporate finance. Grok 4.3 supports a 1 million token context window and is priced at $1.25/m input and $2.50/m output. Create an API key and start building: console.x.ai/team/default/a…
xAI tweet media
English
911
1.1K
10.1K
89.5M
Chenning Li
Chenning Li@chenningli1117·
@stephenx_ Impressed how you made the office agent artifacts so beautiful. 🐐
English
1
0
4
206
Chenning Li
Chenning Li@chenningli1117·
Glad to contribute to the office agent and instruction following work streams.
Artificial Analysis@ArtificialAnlys

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite. Key Takeaways: ➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level ➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula ➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2 ➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3 Congratulations to @xAI and @elonmusk on the impressive release!

English
3
0
11
341
Yihe Deng
Yihe Deng@Yihe__Deng·
Last day at xAI. For a new grad, the past six months have been an irreplaceable experience. I feel fortunate to have made the decision to join this journey with xAI, and grateful for how much I was able to learn here in such a short, dense period of time. I'm proud of what we built, and what the multimodal team continues to build. I have deep faith in this team. No matter where I go next, I'll always look forward to seeing what my friends here pull off and bring into the world. I'm especially grateful to my captains along the way -- people I look up to, trust deeply, and who placed trust in my potential. I truly appreciate all the friends I met here, and the time we spent building together. And thanks xAI for the opportunity, and for giving me the space to learn, contribute, and grow. In the end, the greatest treasure is indeed the journey itself: the problems worth solving, and the people worth building with. Now, it's time to step into the uncertainty of what comes next.
English
49
6
550
35.8K
Haotian Liu
Haotian Liu@imhaotian·
I left xAI earlier this week. It was a difficult decision. The past two years have been an intense, fun, and deeply rewarding journey, and I accomplished things I could not have imagined two years ago. Thank you @elonmusk for the opportunity and for everything I learned at xAI. Thank you @Guodzh for the trust you placed in me and for all the days and late nights we worked through together. And thank you to the entire Omni / Imagine team: thank you for your trust, and for growing together with me. It has been an honor, and I am incredibly proud of what we achieved together. I feel fortunate to have had the chance to work with all of you. At xAI, everything feels possible. I had the chance to work with and learn from some of the most exceptional people I have ever met. I was able to explore across domains: from pretraining to post-training, from language models to multimodal, from perception to generation. Joining xAI was one of the best decisions I have ever made. @grok imagine is special to me. Building video generation models, where I started with almost zero prior knowledge, from 0 to No.1, as an IC and as a lead, alongside an extraordinary team, and shipping it as a great product used by millions, all within 6 months, at age 28: I feel proud. But now it’s time for me to move on. I’m burnt out, and I know my happiness is no longer maximized in my current state. It is sad to say goodbye, but it is just the right time for a change. Best wishes to the Imagine team, you are absolutely the best, and you deserve the best. I will cherish all our memories for the rest of my life. For now, I’m taking a break and giving myself time to figure out what comes next. Posted from Hawaii.
English
162
57
1.9K
179.8K
Chenning Li
Chenning Li@chenningli1117·
@Guodzh Good luck, Guodong! Glad to work with you.
English
0
0
1
838
Guodong Zhang
Guodong Zhang@Guodzh·
Last day at xAI. Wild journey past three years but excited about next chapter. Thanks all for the love and support yesterday. So many friends made along the way and I will miss you all!
English
237
63
2.5K
659.2K
Chenning Li retweetledi
Elon Musk
Elon Musk@elonmusk·
The Grok 4.2 release candidate (public beta) is now available for use. You need to select it specifically. Critical feedback is appreciated. Unlike prior versions of Grok, 4.2 is able to learn rapidly, so there will be improvements every week with release notes.
English
3.5K
3.4K
41.8K
23.5M
Chenning Li
Chenning Li@chenningli1117·
@Yuhu_ai_ All the best, Tony! Glad to work with you.
English
0
0
13
3.5K
Yuhuai (Tony) Wu
Yuhuai (Tony) Wu@Yuhu_ai_·
I resigned from xAI today. This company - and the family we became - will stay with me forever. I will deeply miss the people, the warrooms, and all those battles we have fought together. It's time for my next chapter. It is an era with full possibilities: a small team armed with AIs can move mountains and redefine what's possible. Thank you to the entire xAI family. Onward. 🚀 And to Elon @elonmusk - thank you for believing in the mission and for the ride of a lifetime.
English
739
371
9.3K
3.6M
Chenning Li
Chenning Li@chenningli1117·
@hongyuan_mei All the best Hongyuan! Glad to work with you and your team!🫡
English
1
0
3
218
Hongyuan Mei
Hongyuan Mei@hongyuan_mei·
Today, xAI joins SpaceX and begins a new chapter. I’m starting one too, after an intense and unforgettable ride at xAI. In a remarkably short time, we pushed Grok 4 to be among the most intelligent AIs of 2025. We founded the AI Experts team in the xAI Reasoning org, which I led until today. Our fast-growing team shipped Grok 4.1 Fast—powering Grok in Tesla—and drove deep collaboration across xAI, Tesla, SpaceX, and beyond. The war rooms were intense—full of focus, laughter, stress, joy, and growth. I’m confident the team will carry the mission forward: to understand the universe and extend xAI’s impact to the stars. As for me, it’s time to give back—to my family, and to myself. There are many ideas I couldn’t explore before due to priorities; now it’s finally time to work through them in my own war room. I’m excited about what I may bring back into the world next.
Hongyuan Mei@hongyuan_mei

Working at @xai means you don’t just think about impact on Earth—you imagine @grok reaching Mars and beyond, alongside @SpaceX. Discussing what we can achieve together is genuinely thrilling. The horizon keeps expanding. If this excites you, join us. job-boards.greenhouse.io/xai/jobs/48000…

English
48
21
545
51.7K
Chenning Li retweetledi
Alex Pan
Alex Pan@aypan_17·
We're hiring for the safety team at xAI! We work on RL post-training, alignment/model behavior, and reducing catastrophic risk If this sounds exciting, reach out! (1/3)
English
68
62
1.1K
92.6K