Penghui Qi

167 posts

Penghui Qi

Penghui Qi

@QPHutu

Senior Research Engineer @SeaAIL PhD student @NUSingapore Working on RL, LLM Reasoning, and MLSys.

Katılım Ağustos 2022
185 Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
Penghui Qi
Penghui Qi@QPHutu·
This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇
Penghui Qi tweet media
English
13
74
542
45.3K
Penghui Qi
Penghui Qi@QPHutu·
I was reading megatron code again recently for a moe project. It does improve a lot regarding both the efficiency and code quality, so many known optimization techniques integrated while the code remains readable. Awesome work! 👍
Ethan He@EthanHe_42

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685

English
0
3
35
3.8K
Penghui Qi retweetledi
Yi Wu
Yi Wu@jxwuyi·
AReaL v1.0 released: Effortless #RL to make your #OpenClaw self-evolve 🚀: •🛠️ One-click agentic RL for any existing agent •📈 Open-source SOTA on tau2-bench •💎 A new PyTorch-native 5D-Parallel Engine Archon •🤖A full #opencode recipe GitHub: github.com/inclusionAI/AR…
Yi Wu tweet media
English
7
39
144
58.6K
Penghui Qi
Penghui Qi@QPHutu·
Many thanks for this awesome summary. It’s really encouraging that you love it 😄
Alex Weers@a_weers

Today's paper was "Rethinking the Trust Region in LLM RL". It is a clean theoretical and empirical case for why PPO's (and GRPO's) ratio clipping is fundamentally mismatched to LLM vocabularies, and what to do about it. (Yesterday we saw that DAPO proposes to clip-higher, today we see another solution) The core problem: PPO clips based on the probability ratio of the sampled token. But this is a noisy single-sample Monte Carlo estimate of the true policy divergence. For large vocabularies this breaks in both directions: - A rare token going from 0.001 to 0.003 produces ratio 3.0, gets hard clipped, even though the actual divergence is negligible. This over-penalizes exploration tokens and slows learning. - A dominant token dropping from 0.99 to 0.80 gives ratio 0.81, and stays inside the clip. But you just moved 0.19 probability mass, a potentially catastrophic shift that goes unpenalized. The training/inference engine mismatch makes it worse: even with identical parameters, the probability ratio is highly volatile for low-prob tokens between frameworks, while total variation (TV) divergence stays stable. Their fix (DPPO): replace ratio clipping with a mask conditioned on a direct estimate of policy divergence (TV or KL). Full divergence over the vocab is too expensive, but they show a binary approximation (just compare the sampled token's prob under both policies) or top-K approximation both work well empirically. Three useful empirical takeaways from their stability analysis: 1. Trust region is essential even at tiny learning rates (1e-6). Without it, training/inference mismatch accumulates and training collapses. 2. Trust region must be anchored to the rollout policy, not a recomputed on-policy distribution. Decoupled objectives that use recomputed anchors fail. 3. Instability comes from a tiny fraction (less than 0.5%) of updates on negative samples that push the policy far outside the trust region. A minimal mask blocking just these is enough to stabilize training. DPPO consistently outperforms GRPO-ClipHigher and CISPO baselines across five model configs (Qwen 3B, 8B, MoE, with/without R3, LoRA), often matching or beating R3-enhanced baselines without using rollout router replay at all. It also has a long appendix that I still have to work through :) Overall highly recommend it if you are interested in RL for LLMs and want to know what is going on in the trust regions. Great and important paper from @QPHutu, @NickZhou523786, @zzlccc, @TianyuPang1, @duchao0726, and @mavenlin

English
0
0
10
2.1K
Penghui Qi
Penghui Qi@QPHutu·
@ericssunLeon @tongyx361 Hi Leon, many thanks for your great efforts. 🙏 I just reviewed the PR and leaved some comments (mainly for typo and default hyperparameters). Overall it looks nice!👍
English
0
0
1
24
Penghui Qi
Penghui Qi@QPHutu·
The code change is super simple. 😀 Refer to this PR in case you want to implement by self instead of the latest verl 👇 github.com/verl-project/v…
English
0
0
2
218
Penghui Qi
Penghui Qi@QPHutu·
@ericssunLeon We use the same importance sampling (pi_theta / mu_theta'), only changing the mask. The mask is based on (pi_theta / pi_theta'). It is similar to miniRL, so the training-inference mismatch is also corrected. code: #L1268" target="_blank" rel="nofollow noopener">github.com/sail-sg/Stable…
English
0
0
2
88
Leon
Leon@ericssunLeon·
@QPHutu quick question about the DPPO-KL-Recompute experiments: are you correcting for the training-inference distribution mismatch via importance sampling (pi_theta' / mu_theta')? couldn't spot this in the source code
English
1
0
0
117
Penghui Qi
Penghui Qi@QPHutu·
This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇
Penghui Qi tweet media
English
13
74
542
45.3K
Penghui Qi
Penghui Qi@QPHutu·
Sad for losing a fantastic daily collaborator in Sea AI Lab, but happy for Zichen for his new journey. Thank you for everything, and wish you all the best on your next adventure! 🚀✨
Zichen Liu@zzlccc

Thrilled to share that I’ve joined @GoogleDeepMind to work on Gemini post-training! I feel incredibly fortunate to be cooking on this sunny island under @YiTayML's leadership, within @quocleix's broader organization. Looking forward to enjoying RL research and pushing the frontiers of Gemini alongside such a brilliant team!

English
2
1
20
2.6K
Penghui Qi
Penghui Qi@QPHutu·
@OptionsGod_lgd actually the ratio is much sensitive than the absolute difference. That' why clip-higher is proposed. For the range of delta, we use 0.05 for all KL experiments. For TV, 0.15~0.20 generally works well.
English
0
0
1
202
Brace(Hanyang) Zhao
Brace(Hanyang) Zhao@OptionsGod_lgd·
@QPHutu I guess it is much easier to choose a proper clip ratio to start with like 0.2 in common practice than this absolute difference delta? Any idea of the range of delta and is delta sensitive?
English
1
0
0
582
Penghui Qi
Penghui Qi@QPHutu·
@danisht273 It is an easier and tighter bound than KL. Empirically they perform similarly.
English
0
0
1
75
dánish
dánish@danisht273·
@QPHutu Congrats! This might be a dumb question but why TV divergence specifically over KL or others in DPPO? Is it just easier bounds or has it shown better empirical behavior?
English
1
0
0
520
Penghui Qi
Penghui Qi@QPHutu·
@Massimo26472949 Thanks for your insightful comments. I think I totally agree with your points.
English
0
0
1
114
Massimiliano Brighindi
Massimiliano Brighindi@Massimo26472949·
This is a solid paper, but the core issue is still being framed one level too low. What DPPO fixes is not “PPO’s trust region.” It fixes a *proxy mismatch*. PPO constrains a ratio: pi(y|s) / mu(y|s) But the actual object we care about is *distributional movement* in policy space. The ratio is only a heuristic surrogate, and a bad one in high-dimensional, heavy-tailed token spaces: - it over-penalizes rare tokens, - under-penalizes dominant modes, - and becomes unstable exactly where LLMs concentrate probability mass. DPPO’s contribution is recognizing that: - trust regions should be defined in divergence space (TV, not ratios), - masking should follow *actual policy displacement*, not local likelihood spikes. However, the deeper implication is this: LLM RL instability is not primarily an optimization problem. It is a *geometry-of-representation* problem. Once logits already live in a low-dimensional, anisotropic subspace: - ratio clipping fights symptoms, - divergence-aware masking aligns with the true constraint surface. So DPPO works better because it respects the *pre-existing structure of the policy manifold*, not because it found a better hyperparameterization. In that sense, this paper is consistent with: - logit-linear behavior, - token preference at initialization, - cross-model subspace overlap. The trust region was never wrong. We were just measuring it in the wrong coordinates.
English
1
0
1
193
Penghui Qi
Penghui Qi@QPHutu·
@Frank37004246 A fun fact, we are the authors of Dr.GRPO. So, guess what we are using. We never claim Dr.GRPO as a new algorithm, it is just a done right variant. Similarly, DAPO is just a recipe. Actually we use Dr.GRPO with clip-higher, please check the details before criticizing.
English
0
0
1
116
FrankStain
FrankStain@Frank37004246·
@QPHutu The problem with most of these papers is that they keep comparing against the original GRPO rather than the improved versions that followed, which are still very simple, like DAPO, Dr GRPO. I don't find that an honest assessment of their approach
English
1
0
0
91
Penghui Qi
Penghui Qi@QPHutu·
@YouJiacheng Thanks for your insightful question. You are absolutely right, it would be perfect if it’s an upper bound. Here we mainly want to highlight their relationship, the “principle” word may not be suitable. For TopK TV approximation, it can be upper bounded by the tails prob.
English
1
0
6
382
You Jiacheng
You Jiacheng@YouJiacheng·
I'm confused. The proposed masking criterion is a *lower bound* of true TV divergence. How can constraining a *lower bound* of true TV ensures updates stay within a theoretically grounded trust region??? Shouldn't it be an *upper bound* to achieve this?
You Jiacheng tweet media
Penghui Qi@QPHutu

This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇

English
1
3
36
6.3K