Ruoyu Sun

130 posts

Ruoyu Sun

Ruoyu Sun

@RuoyuSun_UI

Associate Prof at CUHK-Shenzhen. Prev: assistant prof @UofIllinois; postdoc @Stanford; visitor @AIatMeta Work on optimization of machine learning, DL, LLM.

Shenzhen, China Se unió Aralık 2010
603 Siguiendo1.2K Seguidores
Tweet fijado
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli
Ruoyu Sun tweet media
English
7
11
91
9.3K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
Great piece. It made me think about a small self-improvement loop we’ve been studying in post-training. In our recent SePT work, the model trains on its own low-temp reasoning traces over many rounds of updates, so there is already a small recursive loop there. We are trying to understand now is why it works now. Your post also made me think about how understanding the mechanism in small loops like this might eventually help us think about harder self-improvement problems, like parallel agents, task allocation, and coordination.
English
0
0
0
22
Nathan Lambert
Nathan Lambert@natolambert·
I've been grappling with why I obviously see self-improvement with AI models being real but fast take-off being fake. I present Lossy Self Improvement as a way to capture the curse of complexity & diminishing returns in a world of self-improvement. interconnects.ai/p/lossy-self-i…
English
14
27
249
51.7K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
SePT is built on top of verl, making it easy to plug into existing post-training pipelines. The core loop is simple: Low-temp generation; Standard SFT. If you're already using verl, you can adapt SePT in minutes. 🚀 Code: github.com/ElementQi/SePT
English
0
1
1
94
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli
Ruoyu Sun tweet media
English
7
11
91
9.3K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
ICML Reviewer kindly suggested we cite a paper posted on arXiv in March 2026 as related work. For reference, the submission deadline of ICML was January 28, 2026. A fair request, assuming authors are now expected to survey not just the literature, but also the future.
English
12
29
699
48.2K
Deep Insight Labs
Deep Insight Labs@DeepInsightLabs·
@RuoyuSun_UI Is the reviewer affiliated with the authors of the paper you were requested to cite? Suspicious move...
English
1
0
1
182
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@antoniolupetti Thank you for sharing our work, Antonio. We've just updated the paper with more experiments. While the arXiv update is processing, you can find the latest version on GitHub: github.com/ElementQi/SePT…; see discussions and links here: x.com/RuoyuSun_UI/st…
Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English
1
0
4
860
Antonio Lupetti
Antonio Lupetti@antoniolupetti·
"LLM Reasoning: Surprising effectiveness of self-tuning without rewards". Instead of forcing behavior with reinforcement learning this approach fine-tunes the model on its own generated data, reinforcing what it already learned during pretraining. So much high-quality research lately, it’s hard to keep up.😅 arxiv.org/abs/2510.18814
English
7
29
224
16.9K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
Our method, SePT, follows a simple iterative recipe: Sample questions $q$ $\rightarrow$ Generate low-temp responses $o$ $\rightarrow$ Update model via standard SFT $\rightarrow$ Repeat using the updated model. Note the "online refresh" ingredient: By interleaving generation and training, the self-generated data is "refreshed" by the latest model.
Ruoyu Sun tweet media
English
0
1
2
252
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@shyyhs we observed the output diversity dropped a bit, but not too far away from GRPO
English
1
0
2
142
Haiyue Song
Haiyue Song@shyyhs·
@RuoyuSun_UI Thanks you for the great paper! 🤩 I've done similar self-distillation experiments and found the output diversity drops. Have you observed similar phenomenons?
English
1
0
1
239
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
Yes — closely related and we're glad to see this direction getting more attention. Two differences in our work (SePT): 1) Domain: we focus on math reasoning; Apple's paper focuses on coding. 2) Online Refresh: We use "online" interleaving of generation and training. After each update on self-generated responses, the updated model is used to generate the next batch of responses. We also tried an offline variant, where the base model generates all training data at once; overall it helps, but not as much as the online version.
English
1
0
3
91
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@ghosh_satanu22 a bit more context: the reviewer is negatively judging our submission, with the main reason being "missing comparison to related works". In addition, this review is the only negative one out of four reviews
English
0
0
1
93
Satanu Ghosh
Satanu Ghosh@ghosh_satanu22·
@RuoyuSun_UI This is a common problem. It is okay to suggest new paper, but to negatively judge a submission because of this is unacceptable.
English
1
0
10
1.4K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@nrehiew_ We’ve also been exploring a closely related idea on the math reasoning tasks. We use online update. We just shared an updated version here, in case you may be interested: x.com/RuoyuSun_UI/st…
Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English
0
0
1
118
wh
wh@nrehiew_·
Notes on this nice and straightforward experimental paper from Apple on Self-Distillation
wh tweet media
English
2
11
134
10.8K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@BoWang87 Really interesting result! We’ve also been exploring a closely related reward-free self-training direction for math reasoning, and just shared our updated version here: x.com/RuoyuSun_UI/st…
Ruoyu Sun@RuoyuSun_UI

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English
0
0
1
412
Bo Wang
Bo Wang@BoWang87·
Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd
Bo Wang tweet media
English
54
201
1.7K
502.9K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@stevelaskaridis We told the reviewer that this work appeared after the ICML submission deadline, but we did not press the issue much further in the reply. We raised the issue in a letter to the AC, since in this case only the AC could possibly help.
English
0
0
3
882
Steve Laskaridis
Steve Laskaridis@stevelaskaridis·
@RuoyuSun_UI Out of curiosity, what did you respond? Sometimes diplomatic replies (which I absolutely understand) prevent the community from holding reviewers accountable.
English
1
0
1
1.3K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
This is the kind of situation I had imagined for a long time, but had never actually heard of a real case. Hard to believe it really happens! I think the conf organizers can prepare a detailed guideline including "don't ask to cite the same paper in arxiv; don't ask to cite future papers" and use automatic checking, somewhat like the service provided to authors of ICML this year.
English
1
0
1
261
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@sheriyuo A bit more context: the reviewer is asking us not only to cite that paper, but also to add comparison experiments. Our paper is available online about half a year ahead. Yes, there are many irresponsible reviewers, and this is just a different example of that.
English
0
0
9
946
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@ahatamiz1 Do you mean citing a paper that was still under review and not publicly available? If so, that sounds quite awkward.
English
1
0
0
856
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
@RuoyuSun_UI At least you have a viable path to satisfy this demand. We were previously asked to cite our own submission.. like the paper that was under review.
English
2
0
10
2K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
@NoSyu that's sad for the community. This shall be added to the review guideline. BTW: ICLR 2024 has a clear guideline for related work,but I did not find it for ICLR 2025. I believe similar guidelines shall be set up for future conferences
English
0
0
1
447
JinYeong Bak
JinYeong Bak@NoSyu·
@RuoyuSun_UI A similar issue occurred with our ICLR submission. In the meta-review, the AC said that our work lacked novelty due to a related arXiv paper (dated October 12, 2023). But, our ICLR submission was made earlier, on September 22, 2023. Result? rejection...
English
1
0
1
806