Jeonghye Kim (@beanie0__0) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

My 9-month internship journey at Microsoft Research Asia comes to an end. Two keywords throughout: self-distillation and exploration in LLM post-training. What I took away from this journey: self-distillation plays qualitatively different, even opposite roles depending on the type of LLM reasoning. [1] In my first project, we used online self-distillation with text feedback for long-horizon agent tasks, enabling efficient exploration and significant performance gains (up to 128%!). Since self-distillation proved so effective in agent settings, extending it to single-turn math reasoning felt like a natural next step. But across various implementations and models, we observed only a brief initial improvement followed by a decreasing performance. Hmm…Why? The answer wasn't obvious, and I spent much of the remaining months digging into this gap. [2] It came down to one question: when reasoning goes off track, how does the model even know? In world-Bayesian reasoning, where the model interacts with an external environment, the environment provides the answer. In self-Bayesian reasoning, where reasoning is purely internal, there is no such signal. The model's sporadic epistemic verbalization, "wait, is this right?", becomes virtually the only error detection mechanism enabling robust reasoning. [3] This is why self-distillation plays opposite roles in the two settings. In world-Bayesian reasoning, self-distillation internalizes external world knowledge into model parameters. A positive role. But in self-Bayesian reasoning, the teacher already knows the solution, so its trajectory has no reason to contain uncertainty expression. The student learns from this and loses the ability to externalize uncertainty. The very mechanism essential for robust reasoning disappears. [4] This points to the reverse: in self-Bayesian reasoning, encouraging epistemic verbalization rather than suppressing it is what drives meaningful exploration. 👉Papers: [1] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization (ICLR26, 25.09) [2] Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty (Initial release: 26.03, Revised: 26.05) [3] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (26.03) [4] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR (26.05) Lots of exciting results still to share. I believe real-world LLM reasoning requires a good combination of world-Bayesian and self-Bayesian approaches. Looking forward to the discussions! Huge thanks to my colleagues at MSRA for all the discussions and support along the way💚 I feel lucky to have worked with this team.

English

1

4

38

4.9K

Jeonghye Kim retweetledi

Minki Kang@mkkang_1133·1d

🚀 Releasing ✨AXPO✨ an RL method to lift agentic reasoning models past their next scaling tier. Be it math, perception, or search, AXPO fixes the structural blind spot 'just add tools' recipes leave untouched. 8B beats 4x larger 32B baseline on Pass@4. from NVIDIA 🧵 (1/7)

English

4

42

179

12.4K

Jeonghye Kim@beanie0__0·1d

Thank you for the valuable insight! I also agree that the teacher’s role needs to be encouraging autonomous and meaningful exploration from the student. Our recent work explores a similar direction. In complex reasoning tasks like math, we found that "reversing" the self-distillation signal, reinforcing tokens where the student diverges from the teacher yet still succeeds, encourages valuable exploration and leads to substantial gains. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR (x.com/beanie0__0/sta…) Instead of passively overwriting the student’s choices, RLRT treats disagreement as a discovery signal, amplifying self-driven reasoning rather than suppressing it.

English

0

1

369

Omar Khattab@lateinteraction·1d

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

English

39

33

484

66.5K

Jeonghye Kim@beanie0__0·2d

@ericxyuan Thank you for hosting me! I’m really excited to be part of the team😆

English

0

1

55

Eric Xingdi Yuan@ericxyuan·2d

Welcome to the Froggy team, looking forward to working with you!

Jeonghye Kim@beanie0__0

I’ll continue my journey at Microsoft Research Montreal starting this June. I will be working with @Zhengyan_Shi on post-training for coding agents. I’m looking forward to working with this amazing team this summer! 💪

English

1

6

738

Jeonghye Kim@beanie0__0·2d

I’ll continue my journey at Microsoft Research Montreal starting this June. I will be working with @Zhengyan_Shi on post-training for coding agents. I’m looking forward to working with this amazing team this summer! 💪

Jeonghye Kim@beanie0__0

My 9-month internship journey at Microsoft Research Asia comes to an end. Two keywords throughout: self-distillation and exploration in LLM post-training. What I took away from this journey: self-distillation plays qualitatively different, even opposite roles depending on the type of LLM reasoning. [1] In my first project, we used online self-distillation with text feedback for long-horizon agent tasks, enabling efficient exploration and significant performance gains (up to 128%!). Since self-distillation proved so effective in agent settings, extending it to single-turn math reasoning felt like a natural next step. But across various implementations and models, we observed only a brief initial improvement followed by a decreasing performance. Hmm…Why? The answer wasn't obvious, and I spent much of the remaining months digging into this gap. [2] It came down to one question: when reasoning goes off track, how does the model even know? In world-Bayesian reasoning, where the model interacts with an external environment, the environment provides the answer. In self-Bayesian reasoning, where reasoning is purely internal, there is no such signal. The model's sporadic epistemic verbalization, "wait, is this right?", becomes virtually the only error detection mechanism enabling robust reasoning. [3] This is why self-distillation plays opposite roles in the two settings. In world-Bayesian reasoning, self-distillation internalizes external world knowledge into model parameters. A positive role. But in self-Bayesian reasoning, the teacher already knows the solution, so its trajectory has no reason to contain uncertainty expression. The student learns from this and loses the ability to externalize uncertainty. The very mechanism essential for robust reasoning disappears. [4] This points to the reverse: in self-Bayesian reasoning, encouraging epistemic verbalization rather than suppressing it is what drives meaningful exploration. 👉Papers: [1] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization (ICLR26, 25.09) [2] Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty (Initial release: 26.03, Revised: 26.05) [3] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (26.03) [4] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR (26.05) Lots of exciting results still to share. I believe real-world LLM reasoning requires a good combination of world-Bayesian and self-Bayesian approaches. Looking forward to the discussions! Huge thanks to my colleagues at MSRA for all the discussions and support along the way💚 I feel lucky to have worked with this team.

English

1

3

19

3.1K

Jeonghye Kim@beanie0__0·2d

[4] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR (26.05) TL;DR: RLRT reverses self-distillation on correct rollouts by amplifying tokens where the student diverged from the teacher yet still reached the correct answer, treating these as self-driven reasoning, yielding valuable exploration that consistently outperforms both self-distillation and exploration baselines across Qwen3 models. Paper: arxiv.org/abs/2605.10781 (Code is coming soon) Related post: x.com/beanie0__0/sta…

English

0

1

2

262

Jeonghye Kim@beanie0__0·2d

[3] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (26.03) TL;DR: Self-distillation in LLM reasoning (sometimes) degrades OOD performance by suppressing epistemic verbalization, as richer teacher conditioning produces overly confident traces that discard uncertainty signals crucial for generalization, with the effect modulated by task coverage: beneficial under narrow task distributions but harmful as diversity grows. Paper: arxiv.org/abs/2603.24472 Blog: beanie00.notion.site/why-does-self-… Code: github.com/beanie00/self-… Related post: x.com/beanie0__0/sta…

English

1

3

293

Jeonghye Kim@beanie0__0·2d

My 9-month internship journey at Microsoft Research Asia comes to an end. Two keywords throughout: self-distillation and exploration in LLM post-training. What I took away from this journey: self-distillation plays qualitatively different, even opposite roles depending on the type of LLM reasoning. [1] In my first project, we used online self-distillation with text feedback for long-horizon agent tasks, enabling efficient exploration and significant performance gains (up to 128%!). Since self-distillation proved so effective in agent settings, extending it to single-turn math reasoning felt like a natural next step. But across various implementations and models, we observed only a brief initial improvement followed by a decreasing performance. Hmm…Why? The answer wasn't obvious, and I spent much of the remaining months digging into this gap. [2] It came down to one question: when reasoning goes off track, how does the model even know? In world-Bayesian reasoning, where the model interacts with an external environment, the environment provides the answer. In self-Bayesian reasoning, where reasoning is purely internal, there is no such signal. The model's sporadic epistemic verbalization, "wait, is this right?", becomes virtually the only error detection mechanism enabling robust reasoning. [3] This is why self-distillation plays opposite roles in the two settings. In world-Bayesian reasoning, self-distillation internalizes external world knowledge into model parameters. A positive role. But in self-Bayesian reasoning, the teacher already knows the solution, so its trajectory has no reason to contain uncertainty expression. The student learns from this and loses the ability to externalize uncertainty. The very mechanism essential for robust reasoning disappears. [4] This points to the reverse: in self-Bayesian reasoning, encouraging epistemic verbalization rather than suppressing it is what drives meaningful exploration. 👉Papers: [1] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization (ICLR26, 25.09) [2] Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty (Initial release: 26.03, Revised: 26.05) [3] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (26.03) [4] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR (26.05) Lots of exciting results still to share. I believe real-world LLM reasoning requires a good combination of world-Bayesian and self-Bayesian approaches. Looking forward to the discussions! Huge thanks to my colleagues at MSRA for all the discussions and support along the way💚 I feel lucky to have worked with this team.

English

1

4

38

4.9K

Jeonghye Kim retweetledi

Woogyeol Jin@wg_jin02·3d

🚀 On-Policy Distillation (OPD) has gained attention for its efficiency over RLVR, thanks to its dense supervision signal. While reverse KL-based OPD effectively captures the teacher's dominant modes, it has a limitation. What's the problem? In reasoning tasks, high-entropy tokens, where the teacher hesitates, mark decision points where multiple valid reasoning paths diverge. OPD fails to transfer the teacher effectively at these positions. ✨ We introduce EOPD (Entropy-Aware On-Policy Distillation), which addresses this by augmenting OPD with a forward KL term on high-entropy tokens. 📄 Paper: arxiv.org/abs/2603.07079 💻 Code: github.com/WLS04/EOPD

English

3

21

121

21.2K

Jeonghye Kim retweetledi

Yifan Yang@Yif_Yang·4d

🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 aka.ms/skillopt 📄 huggingface.co/papers/2605.23…

English

50

105

830

83.4K

Jeonghye Kim retweetledi

Shuyao Tim Xu@TimXu222575·20 May

Hot take: OPSD/SDFT/SDPO works if and only if off policy context distillation works in the same setup. In both algorithms, what and where is the hint matter the most. For example, placing the hint in system prompt or system reminder is better then in user prompt, as it leaks less

English

7

124

19.4K

Jeonghye Kim@beanie0__0·20 May

Exciting to see AntiSD, which shares a similar motivation & analysis with RLRT, flipping the OPSD signal for big efficiency & performance gains. The two papers were archived just one day apart 👀 It's truly time to reverse self-distillation in math!! 🔄 More on RLRT here 👇 x.com/beanie0__0/sta…

DailyPapers@HuggingPapers

Anti-Self-Distillation for Reasoning RL Invert the divergence. Preserving deliberation tokens like "Wait" and "Maybe" instead of template parroting leads to 2-10x faster convergence and +11.5 points on AIME/HMMT across 4B-30B models.

English

0

3

830

Jeonghye Kim@beanie0__0·19 May

Great to see RL with self-distillation (w/ text feedback) in agent setups being scaled to a production Cursor model! If you're interested in this regime, I highly recommend "Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization" (ICLR'26). In multi-turn agents interacting with external environments, it shows how agents can distill self-generated textual tips during RL training to correct past failures and explore more efficiently, achieving up to a 128.6% performance improvement🚀 📄 Paper: arxiv.org/abs/2602.23008 📝 Blog: agent-lightning.github.io/posts/empo2/ 💻 Code: github.com/microsoft/agen…

GIF

Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English

0

9

46

4.7K

Jeonghye Kim

Keşfet