Penghui Qi

167 posts

Penghui Qi

@QPHutu

Senior Research Engineer @SeaAIL PhD student @NUSingapore Working on RL, LLM Reasoning, and MLSys.

Katılım Ağustos 2022

185 Takip Edilen1.1K Takipçiler

Sabitlenmiş Tweet

Penghui Qi@QPHutu·5 Şub

This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇

English

542

45.3K

Penghui Qi@QPHutu·6d

Happy to see our Dr.GRPO and DPPO in the list 👇 It's really a nice blog for a quick review of RL algorithms for LLM reasoning. Worth a reading 👍

Alex Weers@a_weers

Finally finished! If you're interested in an overview of recent methods in reinforcement learning for reasoning LLMs, check out this blog post: aweers.de/blog/2026/rl-f… It summarizes ten methods, tries to highlight differences and trends, and has a collection of open problems

English

4.1K

Penghui Qi@QPHutu·11 Mar

Our another great talent from Sea AI Lab🫡 Congrats Chao bro🎉

Chao Du@duchao0726

Understanding the real world is key to building advanced AI systems. Excited to join @amilabs at launch with a brilliant team to make it happen!

English

2.1K

Penghui Qi@QPHutu·10 Mar

Finally announced. Congrats 🎉

Min Lin@mavenlin

The most exciting breakthroughs in intelligence are yet to come. I’m super excited to start this journey with mes amis to make them happen together.

English

5.2K

Penghui Qi@QPHutu·10 Mar

I was reading megatron code again recently for a moe project. It does improve a lot regarding both the efficiency and code quality, so many known optimization techniques integrated while the code remains readable. Awesome work! 👍

Ethan He@EthanHe_42

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685

English

3.8K

Penghui Qi retweetledi

Yi Wu@jxwuyi·4 Mar

AReaL v1.0 released: Effortless #RL to make your #OpenClaw self-evolve 🚀: •🛠️ One-click agentic RL for any existing agent •📈 Open-source SOTA on tau2-bench •💎 A new PyTorch-native 5D-Parallel Engine Archon •🤖A full #opencode recipe GitHub: github.com/inclusionAI/AR…

English

144

58.6K

Penghui Qi@QPHutu·3 Mar

Many thanks for this awesome summary. It’s really encouraging that you love it 😄

Alex Weers@a_weers

Today's paper was "Rethinking the Trust Region in LLM RL". It is a clean theoretical and empirical case for why PPO's (and GRPO's) ratio clipping is fundamentally mismatched to LLM vocabularies, and what to do about it. (Yesterday we saw that DAPO proposes to clip-higher, today we see another solution) The core problem: PPO clips based on the probability ratio of the sampled token. But this is a noisy single-sample Monte Carlo estimate of the true policy divergence. For large vocabularies this breaks in both directions: - A rare token going from 0.001 to 0.003 produces ratio 3.0, gets hard clipped, even though the actual divergence is negligible. This over-penalizes exploration tokens and slows learning. - A dominant token dropping from 0.99 to 0.80 gives ratio 0.81, and stays inside the clip. But you just moved 0.19 probability mass, a potentially catastrophic shift that goes unpenalized. The training/inference engine mismatch makes it worse: even with identical parameters, the probability ratio is highly volatile for low-prob tokens between frameworks, while total variation (TV) divergence stays stable. Their fix (DPPO): replace ratio clipping with a mask conditioned on a direct estimate of policy divergence (TV or KL). Full divergence over the vocab is too expensive, but they show a binary approximation (just compare the sampled token's prob under both policies) or top-K approximation both work well empirically. Three useful empirical takeaways from their stability analysis: 1. Trust region is essential even at tiny learning rates (1e-6). Without it, training/inference mismatch accumulates and training collapses. 2. Trust region must be anchored to the rollout policy, not a recomputed on-policy distribution. Decoupled objectives that use recomputed anchors fail. 3. Instability comes from a tiny fraction (less than 0.5%) of updates on negative samples that push the policy far outside the trust region. A minimal mask blocking just these is enough to stabilize training. DPPO consistently outperforms GRPO-ClipHigher and CISPO baselines across five model configs (Qwen 3B, 8B, MoE, with/without R3, LoRA), often matching or beating R3-enhanced baselines without using rollout router replay at all. It also has a long appendix that I still have to work through :) Overall highly recommend it if you are interested in RL for LLMs and want to know what is going on in the trust regions. Great and important paper from @QPHutu, @NickZhou523786, @zzlccc, @TianyuPang1, @duchao0726, and @mavenlin

English

2.1K

Penghui Qi@QPHutu·27 Şub

@ericssunLeon @tongyx361 Hi Leon, many thanks for your great efforts. 🙏 I just reviewed the PR and leaved some comments (mainly for typo and default hyperparameters). Overall it looks nice!👍

English

Leon@ericssunLeon·27 Şub

@QPHutu @tongyx361 this is the PR in TRL if someone from your team is interested in having a look github.com/huggingface/tr…

English

Penghui Qi@QPHutu·26 Şub

DPPO has already been supported in verl! 🎇 Big thanks @tongyx361 for the quick review🙏 Check this example for a quick start👇 github.com/verl-project/v…

Penghui Qi@QPHutu

English

5.5K

Penghui Qi@QPHutu·26 Şub

The code change is super simple. 😀 Refer to this PR in case you want to implement by self instead of the latest verl 👇 github.com/verl-project/v…

English

218

Penghui Qi@QPHutu·19 Şub

Adding a perturbation to make the RL training more robust. Love this interesting idea!

Chenlu Ye@ye_chenlu

1/5 Happy CNY🎊 Still bothered by RL off-policy instability in LLM? Introducing a new way💡Adaptive Layerwise Perturbation (ALP)💡, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! 🔗 Blog: beneficial-curiosity-d98.notion.site/Adaptive-Layer…

English

865

Penghui Qi@QPHutu·15 Şub

@ericssunLeon We use the same importance sampling (pi_theta / mu_theta'), only changing the mask. The mask is based on (pi_theta / pi_theta'). It is similar to miniRL, so the training-inference mismatch is also corrected. code: #L1268" target="_blank" rel="nofollow noopener">github.com/sail-sg/Stable…

English

Leon@ericssunLeon·15 Şub

@QPHutu quick question about the DPPO-KL-Recompute experiments: are you correcting for the training-inference distribution mismatch via importance sampling (pi_theta' / mu_theta')? couldn't spot this in the source code

English

117

Penghui Qi@QPHutu·5 Şub

English

542

45.3K

Penghui Qi@QPHutu·13 Şub

Interesting and impressive!

Yi Tay@YiTayML

Introducing Aletheia, a math research agent powered by an advanced version of Gemini Deep Think that produces publishable math research (two papers, one completely automatic and another with human-AI collaboration) and solved multiple open Erdős problems. 😀🔥 Paper link below! 👇

English

1.4K

Penghui Qi@QPHutu·10 Şub

Sad for losing a fantastic daily collaborator in Sea AI Lab, but happy for Zichen for his new journey. Thank you for everything, and wish you all the best on your next adventure! 🚀✨

Zichen Liu@zzlccc

Thrilled to share that I’ve joined @GoogleDeepMind to work on Gemini post-training! I feel incredibly fortunate to be cooking on this sunny island under @YiTayML's leadership, within @quocleix's broader organization. Looking forward to enjoying RL research and pushing the frontiers of Gemini alongside such a brilliant team!

English

2.6K

Penghui Qi@QPHutu·9 Şub

Thanks for this quick evaluation 🫡 DPPO is really a cool alternative to GRPO and its variants. Looking forward to more evaluations from the community 🚀

Fanqing Meng@FanqingMengAI

I have evaluated this, really good! (for dark blue one is dppo)

English

2.5K

Penghui Qi@QPHutu·9 Şub

@OptionsGod_lgd actually the ratio is much sensitive than the absolute difference. That' why clip-higher is proposed. For the range of delta, we use 0.05 for all KL experiments. For TV, 0.15~0.20 generally works well.

English

202

Brace(Hanyang) Zhao@OptionsGod_lgd·5 Şub

@QPHutu I guess it is much easier to choose a proper clip ratio to start with like 0.2 in common practice than this absolute difference delta? Any idea of the range of delta and is delta sensitive?

English

582

Penghui Qi@QPHutu·9 Şub

@danisht273 It is an easier and tighter bound than KL. Empirically they perform similarly.

English

dánish@danisht273·5 Şub

@QPHutu Congrats! This might be a dumb question but why TV divergence specifically over KL or others in DPPO? Is it just easier bounds or has it shown better empirical behavior?

English

520

Penghui Qi@QPHutu·9 Şub

@Massimo26472949 Thanks for your insightful comments. I think I totally agree with your points.

English

114

Massimiliano Brighindi@Massimo26472949·7 Şub

This is a solid paper, but the core issue is still being framed one level too low. What DPPO fixes is not “PPO’s trust region.” It fixes a *proxy mismatch*. PPO constrains a ratio: pi(y|s) / mu(y|s) But the actual object we care about is *distributional movement* in policy space. The ratio is only a heuristic surrogate, and a bad one in high-dimensional, heavy-tailed token spaces: - it over-penalizes rare tokens, - under-penalizes dominant modes, - and becomes unstable exactly where LLMs concentrate probability mass. DPPO’s contribution is recognizing that: - trust regions should be defined in divergence space (TV, not ratios), - masking should follow *actual policy displacement*, not local likelihood spikes. However, the deeper implication is this: LLM RL instability is not primarily an optimization problem. It is a *geometry-of-representation* problem. Once logits already live in a low-dimensional, anisotropic subspace: - ratio clipping fights symptoms, - divergence-aware masking aligns with the true constraint surface. So DPPO works better because it respects the *pre-existing structure of the policy manifold*, not because it found a better hyperparameterization. In that sense, this paper is consistent with: - logit-linear behavior, - token preference at initialization, - cross-model subspace overlap. The trust region was never wrong. We were just measuring it in the wrong coordinates.

English

193

Penghui Qi@QPHutu·9 Şub

@Frank37004246 A fun fact, we are the authors of Dr.GRPO. So, guess what we are using. We never claim Dr.GRPO as a new algorithm, it is just a done right variant. Similarly, DAPO is just a recipe. Actually we use Dr.GRPO with clip-higher, please check the details before criticizing.

English

116

FrankStain@Frank37004246·8 Şub

@QPHutu The problem with most of these papers is that they keep comparing against the original GRPO rather than the improved versions that followed, which are still very simple, like DAPO, Dr GRPO. I don't find that an honest assessment of their approach

English

Penghui Qi@QPHutu·5 Şub

@YouJiacheng Thanks for your insightful question. You are absolutely right, it would be perfect if it’s an upper bound. Here we mainly want to highlight their relationship, the “principle” word may not be suitable. For TopK TV approximation, it can be upper bounded by the tails prob.

English

382

You Jiacheng@YouJiacheng·5 Şub

I'm confused. The proposed masking criterion is a *lower bound* of true TV divergence. How can constraining a *lower bound* of true TV ensures updates stay within a theoretically grounded trust region??? Shouldn't it be an *upper bound* to achieve this?

Penghui Qi@QPHutu

English

6.3K

Penghui Qi@QPHutu·5 Şub

Exactly! 🫡 From this figure, we should use a "linear" adaptive threshold for ratio clipping, which is exactly our DPPO-Binary-TV variant!

Zichen Liu@zzlccc

Enjoyed working on this! Viewing from another angle: PPO ratio clipping uses a fixed threshold for all tokens, which creates a bias toward clipping low-prob tokens more often. Instead, we can use an 𝐚𝐝𝐚𝐩𝐭𝐢𝐯𝐞 𝐜𝐥𝐢𝐩𝐩𝐢𝐧𝐠 𝐭𝐡𝐫𝐞𝐬𝐡𝐨𝐥𝐝—which amounts to DPPO!

English

1.7K

Keşfet

@ericssunLeon @tongyx361 @OptionsGod_lgd @danisht273 @elonmusk @BarackObama @taylorswift13 @cristiano