Ruoyu Sun

130 posts

Ruoyu Sun

@RuoyuSun_UI

Associate Prof at CUHK-Shenzhen. Prev: assistant prof @UofIllinois; postdoc @Stanford; visitor @AIatMeta Work on optimization of machine learning, DL, LLM.

Shenzhen, China Se unió Aralık 2010

603 Siguiendo1.2K Seguidores

Tweet fijado

Ruoyu Sun@RuoyuSun_UI·1d

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English

9.3K

Ruoyu Sun@RuoyuSun_UI·4h

Great piece. It made me think about a small self-improvement loop we’ve been studying in post-training. In our recent SePT work, the model trains on its own low-temp reasoning traces over many rounds of updates, so there is already a small recursive loop there. We are trying to understand now is why it works now. Your post also made me think about how understanding the mechanism in small loops like this might eventually help us think about harder self-improvement problems, like parallel agents, task allocation, and coordination.

English

Nathan Lambert@natolambert·22 Mar

I've been grappling with why I obviously see self-improvement with AI models being real but fast take-off being fake. I present Lossy Self Improvement as a way to capture the curse of complexity & diminishing returns in a world of self-improvement. interconnects.ai/p/lossy-self-i…

English

249

51.7K

Ruoyu Sun@RuoyuSun_UI·18h

SePT is built on top of verl, making it easy to plug into existing post-training pipelines. The core loop is simple: Low-temp generation; Standard SFT. If you're already using verl, you can adapt SePT in minutes. 🚀 Code: github.com/ElementQi/SePT

English

Ruoyu Sun@RuoyuSun_UI·1d

English

9.3K

Ruoyu Sun@RuoyuSun_UI·19h

@orvieto_antonio @jjvie it’s no longer enough to be state-of-the-art; we now have to be state-of-the-future

English

Antonio Orvieto@orvieto_antonio·1d

@RuoyuSun_UI @jjvie similar nonsense: they asked us to compare with Mamba3, on arxiv on March 16

English

385

Ruoyu Sun@RuoyuSun_UI·2d

ICML Reviewer kindly suggested we cite a paper posted on arXiv in March 2026 as related work. For reference, the submission deadline of ICML was January 28, 2026. A fair request, assuming authors are now expected to survey not just the literature, but also the future.

English

699

48.2K

Ruoyu Sun@RuoyuSun_UI·21h

@DeepInsightLabs we guess so, but no evidence

English

Deep Insight Labs@DeepInsightLabs·1d

@RuoyuSun_UI Is the reviewer affiliated with the authors of the paper you were requested to cite? Suspicious move...

English

182

Ruoyu Sun@RuoyuSun_UI·1d

@antoniolupetti Thank you for sharing our work, Antonio. We've just updated the paper with more experiments. While the arXiv update is processing, you can find the latest version on GitHub: github.com/ElementQi/SePT…; see discussions and links here: x.com/RuoyuSun_UI/st…

Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass @1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English

860

Antonio Lupetti@antoniolupetti·1d

"LLM Reasoning: Surprising effectiveness of self-tuning without rewards". Instead of forcing behavior with reinforcement learning this approach fine-tunes the model on its own generated data, reinforcing what it already learned during pretraining. So much high-quality research lately, it’s hard to keep up.😅 arxiv.org/abs/2510.18814

English

224

16.9K

Ruoyu Sun@RuoyuSun_UI·1d

Our method, SePT, follows a simple iterative recipe: Sample questions $q$ $\rightarrow$ Generate low-temp responses $o$ $\rightarrow$ Update model via standard SFT $\rightarrow$ Repeat using the updated model. Note the "online refresh" ingredient: By interleaving generation and training, the self-generated data is "refreshed" by the latest model.

English

252

Ruoyu Sun@RuoyuSun_UI·1d

@shyyhs we observed the output diversity dropped a bit, but not too far away from GRPO

English

142

Haiyue Song@shyyhs·1d

@RuoyuSun_UI Thanks you for the great paper! 🤩 I've done similar self-distillation experiments and found the output diversity drops. Have you observed similar phenomenons?

English

239

Ruoyu Sun@RuoyuSun_UI·1d

Yes — closely related and we're glad to see this direction getting more attention. Two differences in our work (SePT): 1) Domain: we focus on math reasoning; Apple's paper focuses on coding. 2) Online Refresh: We use "online" interleaving of generation and training. After each update on self-generated responses, the updated model is used to generate the next batch of responses. We also tried an offline variant, where the base model generates all training data at once; overall it helps, but not as much as the online version.

English

Blaze (Balázs Galambosi)@gblazex·1d

@RuoyuSun_UI Is this related to the recent “embarrassingly simple” Apple paper ?

English

243

Ruoyu Sun@RuoyuSun_UI·1d

@ghosh_satanu22 a bit more context: the reviewer is negatively judging our submission, with the main reason being "missing comparison to related works". In addition, this review is the only negative one out of four reviews

English

Satanu Ghosh@ghosh_satanu22·2d

@RuoyuSun_UI This is a common problem. It is okay to suggest new paper, but to negatively judge a submission because of this is unacceptable.

English

1.4K

Ruoyu Sun@RuoyuSun_UI·1d

@nrehiew_ We’ve also been exploring a closely related idea on the math reasoning tasks. We use online update. We just shared an updated version here, in case you may be interested: x.com/RuoyuSun_UI/st…

Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass @1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English

118

wh@nrehiew_·2d

Notes on this nice and straightforward experimental paper from Apple on Self-Distillation

English

134

10.8K

Ruoyu Sun@RuoyuSun_UI·1d

@BoWang87 Really interesting result! We’ve also been exploring a closely related reward-free self-training direction for math reasoning, and just shared our updated version here: x.com/RuoyuSun_UI/st…

Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass @1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English

412

Bo Wang@BoWang87·4d

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

201

1.7K

502.9K

Ruoyu Sun@RuoyuSun_UI·2d

@stevelaskaridis We told the reviewer that this work appeared after the ICML submission deadline, but we did not press the issue much further in the reply. We raised the issue in a letter to the AC, since in this case only the AC could possibly help.

English

882

Steve Laskaridis@stevelaskaridis·2d

@RuoyuSun_UI Out of curiosity, what did you respond? Sometimes diplomatic replies (which I absolutely understand) prevent the community from holding reviewers accountable.

English

1.3K

Ruoyu Sun@RuoyuSun_UI·2d

This is the kind of situation I had imagined for a long time, but had never actually heard of a real case. Hard to believe it really happens! I think the conf organizers can prepare a detailed guideline including "don't ask to cite the same paper in arxiv; don't ask to cite future papers" and use automatic checking, somewhat like the service provided to authors of ICML this year.

English

261

Ali Hatamizadeh@ahatamiz1·2d

@RuoyuSun_UI No. Citing the arXiV version of the same paper that is under review.

English

271

Ruoyu Sun@RuoyuSun_UI·2d

@sheriyuo A bit more context: the reviewer is asking us not only to cite that paper, but also to add comparison experiments. Our paper is available online about half a year ahead. Yes, there are many irresponsible reviewers, and this is just a different example of that.

English

946

Xiuyu Li@sheriyuo·2d

@RuoyuSun_UI If it is just a suggestion to add a citation, that is totally fine and maybe it is even the reviewer's own paper. At least that is much better than the people giving a 2 or replying with a single c response after rebuttal. x.com/sheriyuo/statu…

Xiuyu Li@sheriyuo

Got hit with a 2 score reviewer at ICML who clearly did not read the paper carefully. We wrote a 5000 word rebuttal and got back a single line saying thanks for the reply, however. There really should be some kind of reviewer reputation or credit system. Too many people seem to treat reviewing like a place to take out their frustration after their own papers get rejected.

English

4.9K

Ruoyu Sun@RuoyuSun_UI·2d

@ahatamiz1 Do you mean citing a paper that was still under review and not publicly available? If so, that sounds quite awkward.

English

856

Ali Hatamizadeh@ahatamiz1·2d

@RuoyuSun_UI At least you have a viable path to satisfy this demand. We were previously asked to cite our own submission.. like the paper that was under review.

English

Ruoyu Sun@RuoyuSun_UI·2d

@NoSyu that's sad for the community. This shall be added to the review guideline. BTW: ICLR 2024 has a clear guideline for related work,but I did not find it for ICLR 2025. I believe similar guidelines shall be set up for future conferences

English

447

JinYeong Bak@NoSyu·2d

@RuoyuSun_UI A similar issue occurred with our ICLR submission. In the meta-review, the AC said that our work lacked novelty due to a related arXiv paper (dated October 12, 2023). But, our ICLR submission was made earlier, on September 22, 2023. Result? rejection...

English

806

Descubrir

@Phanron_xli @orvieto_antonio @jjvie @DeepInsightLabs @antoniolupetti @shyyhs @ghosh_satanu22 @nrehiew_