Anthony GX-Chen

96 posts

Anthony GX-Chen

Anthony GX-Chen

@AntChen_

PhD student @CILVRatNYU. Student Researcher @GoogleDeepMind || Prev: @Meta, @Mila_Quebec, @mcgillu || RL, ML, Neuroscience.

Manhattan, NY Katılım Ağustos 2017
315 Takip Edilen483 Takipçiler
Anthony GX-Chen retweetledi
Ayush Jhaveri
Ayush Jhaveri@arhjhaveri·
Your AI Agent just formed a hypothesis. 💭 How does it validate it? Not by trying to prove itself wrong. Rather, it selectively seeks evidence that confirms what it already believes, often ending up with the wrong answer! Confirmation bias isn’t just human. We measure it in LLMs, and we show how to fix it! 🧵
Ayush Jhaveri tweet media
English
3
13
46
7.5K
Mark Goldstein
Mark Goldstein@marikgoldstein·
Replica trick, Gaussian Integral trick, Hutchinson's trick, Reparameterisation tricks, Log derivative trick, ...
English
1
1
2
444
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@ahatamiz1 @jeanfrancois287 Nice write up :) You may enjoy our paper which discussed many of the same problems, and make use of the odds ratio to fix it Re: entropy reg, perhaps it’s easier to think of it also as KL reg, but to a uniform prior?
Anthony GX-Chen@AntChen_

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English
1
0
7
336
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
🧵 In RLVR, KL is often solving the wrong problem. In RLHF, KL regularization makes sense. In RLVR, the thing you usually want is exploration. 1/ In RLHF, a KL penalty makes sense: You optimize something like: J(θ) = E[R_φ(y|x)] - β KL(π_θ || π_ref) where R_φ is a learned reward model. That reward is a proxy. If you optimize too hard against a proxy, you get reward hacking. The KL term keeps the policy near the reference model. 2/ RLVR is different: In RLVR, the reward is often verifiable: * math answer is right or wrong * code passes tests or fails * a proof checker accepts or rejects So the main problem is usually not “the model is exploiting a learned reward model.” The main problem is often much simpler: the policy collapses onto a narrow set of successful templates and stops exploring. That is an entropy problem, not automatically a “stay close to the reference model” problem. 3/ The KL penalty builds in a bias toward whatever the reference model already liked: For the KL-regularized objective, the optimal policy has the form: π*(y|x) = (1 / Z(x)) π_ref(y|x) exp(R(y|x) / β) with Z(x) = Σ_y π_ref(y|x) exp(R(y|x) / β) That means reward is not the only thing that matters. Reference probability matters too. If a reasoning strategy gets high reward but the reference model assigns it tiny probability, the KL-regularized optimum still pushes against it. 4/ The bias shows up directly in the odds ratio: For two responses, novel and familiar: # log [ π*(y_novel|x) / π*(y_familiar|x) ] (R_novel - R_familiar)/β + log [ p_ref / p_ref' ] That second term is a fixed prior bias toward the familiar strategy. So the novel strategy only wins if its reward advantage overcomes the reference model’s preference for the familiar one. This does not make novelty literally impossible in general. But it does make it systematically harder, sometimes much harder. 5/ Another common confusion: KL regularization and entropy regularization are not the same thing. KL(π || π_ref) = -H(π) - E_π[log π_ref] So KL mixes together two very different forces: * keep policy entropy high * keep policy similar to the reference The first one helps avoid collapse. The second one anchors the policy to old behavior. If collapse is the real failure mode, then directly rewarding entropy is a cleaner fix. 6/ With an entropy bonus, the objective becomes: J(θ) = E_π[R(y|x)] + α H(π) The corresponding optimum is: π*(y|x) ∝ exp(R(y|x) / α) Now there is no reference anchor. The policy can spread mass across any high-reward region, including strategies the base model did not strongly prefer. That is exactly what you want if you are trying to discover better reasoning, not preserve old reasoning. 7/ Important things to be aware of: RLVR rewards are often more grounded than RLHF rewards, but they are not always perfect. Tests can be incomplete. Graders can be brittle. Format rewards can distort behavior. So this is not KL is always wrong. It is that inn many RLVR settings, KL is solving the wrong primary problem. 8/ The practical question is: What are you actually trying to prevent? If the concern is proxy exploitation from a learned reward model: KL to a reference can make sense. If the concern is entropy collapse in verifiable-reward training: an entropy bonus is more directly targeted. Those are different regimes. They should not default to the same regularizer. 9/ Takeaway: In RLVR: drop the KL, add an entropy bonus, clip your gradients, and let the model think.
English
5
8
123
9.4K
Anthony GX-Chen retweetledi
Jocelyn Shen
Jocelyn Shen@jocelynjshen·
Excited to share our #CHI2026 paper “Texterial: A Text-as-Material Interaction Paradigm for LLM-Mediated Writing” (done during internship at Microsoft Research) We imagine interacting with LLMs by treating text as a material like plants/clay. 📃arxiv.org/pdf/2603.00452 🧵[1/n]
English
4
24
159
18.3K
Anthony GX-Chen retweetledi
Ankit Anand
Ankit Anand@ankit_s_anand·
Hi Everyone, we are hiring for Ph.D student researchers in the field of ``Search/RL with LLMs'' ideally for discovery. Please respond in this form if you are interested. Please don't reach out directly as i may not be able to reply individually forms.gle/CxY4VzQRdJacLX…
English
7
24
230
23.2K
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@ye_chenlu Really neat! IIUC I saw you add Gaussian noise to the per layer activations. How do you update the noise scale parameter (re: sigma is learnable)?
English
1
0
0
30
Chenlu Ye
Chenlu Ye@ye_chenlu·
3/5 🤯 ALP in one line: ONE unified ratio: perturbed policy / inference policy Key: add learnable perturbations to each transformer layer (train-time only; inference policy unchanged). Theory: smooth sharp objective + tighten tail mismatch → stability + better exploration.
Chenlu Ye tweet mediaChenlu Ye tweet mediaChenlu Ye tweet media
English
2
1
5
877
Chenlu Ye
Chenlu Ye@ye_chenlu·
1/5 Happy CNY🎊 Still bothered by RL off-policy instability in LLM? Introducing a new way💡Adaptive Layerwise Perturbation (ALP)💡, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! 🔗 Blog: beneficial-curiosity-d98.notion.site/Adaptive-Layer…
Chenlu Ye tweet mediaChenlu Ye tweet mediaChenlu Ye tweet media
English
4
28
145
24.2K
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@kimbochen Intuition: small difference in reward + π_ref drive big difference in probability between samples (in the target dist). Referencing another sample let us cancel out these differences to give equal probs We pick a ref sample with high reward and π_ref (offline or within batch)
English
0
0
1
28
Kimbo
Kimbo@kimbochen·
@AntChen_ What’s the intuition behind referencing a different sample, and how do you pick what sample to use? Great work, thanks for sharing
English
1
0
1
18
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]
English
14
65
559
50.5K
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@siddarthv66 @Teknium Interesting! Curious to hear more about your intuition in the non-asymptotic regime IIUC, are you saying token entropy tends to stay high for a longer time on tasks where the base model have not seen in pre-training? (didn't quite understand "data contaminated")
English
1
0
0
67
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
>The policy can (will) collapse to be fully deterministic asymptotically in theory yes. In practice absolutely not. It has been observed that entropy collapse isn't really guaranteed at all, especially when allowed large sequence lengths. entropy collapse tends to happen in tasks which are close to data-contaminated with the base model. in fact, adding KL reg can actually destabilize training due to added variance. here's an example run on sokoban reasoning gym env with qwen-4B and 16k context, entropy behaves quite erratically during training, while reward also improves. sequence length remains fairly static throughout.
Siddarth Venkatraman tweet mediaSiddarth Venkatraman tweet mediaSiddarth Venkatraman tweet media
English
1
0
2
97
Anthony GX-Chen retweetledi
Jeff Guo
Jeff Guo@JeffGuo__·
Check out work led by @AntChen_ that introduces MARA to mitigate mode collapse in KL/entropy-regularized RL! Focusing on molecular design, there is generally a trade-off between generating high reward samples and sample diversity. (1/2)
Anthony GX-Chen@AntChen_

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English
1
1
9
994
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@ParshinShojaee Good Q! If we don't regularize, the optimal RL solution is degenerate. You can place all mass on a single optimal action or split it arbitrarily among several So in practice the optimizer will almost always prefer collapsing to a deterministic solution
English
0
0
1
293
Parshin Shojaee
Parshin Shojaee@ParshinShojaee·
@AntChen_ but we still see the entropy collapse behavior in recent RLVR pipelines that mostly do not use KL?
English
1
0
0
399
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
@Teknium Entropy reg = KL reg to a uniform prior, so we'll still have this exponential difference in probabilities behaviour x.com/AntChen_/statu… Separately, if we don't regularize at all, the optimal solution is degenerate. The policy can (will) collapse to be fully deterministic
Anthony GX-Chen@AntChen_

First, linear differences in rewards -> exponential differences in probabilities. With low KL reg (e.g. 1e-3), we effectively have a single solution (no diversity). N.B. Entropy regularization has this problem as well. [6/n]

English
1
0
4
537
Teknium (e/λ)
Teknium (e/λ)@Teknium·
@AntChen_ Nobody even uses KL constraints in modern GRPO/RLVR though - so... its just entropy regularization?
English
1
0
15
1.8K
Anthony GX-Chen retweetledi
Jatin Prakash
Jatin Prakash@bicycleman15·
Check out our new work on understanding diversity collapse in RL, and how to principally fix it in under 2-3 lines of code! 🤔 one new thing I learnt: intuition of reverse/forward KL being mode covering/mass seeking depends a LOT on the proposal distribution being optimized!
Jatin Prakash tweet media
Anthony GX-Chen@AntChen_

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English
1
11
49
4.9K
Anthony GX-Chen
Anthony GX-Chen@AntChen_·
That’s it! You should try this simple change to get free diversity. Lots of future works open: by viewing RL through its target distribution, we can have more principled insights on everything from exploration to entropy collapse. It’s all just distribution matching. [14/n]
English
1
0
18
914