Anthony GX-Chen

0

74

Mark Goldstein@marikgoldstein·6 Nis

Replica trick, Gaussian Integral trick, Hutchinson's trick, Reparameterisation tricks, Log derivative trick, ...

English

2

444

Mark Goldstein@marikgoldstein·6 Nis

does anyone remember Shakir Mohamed's ML trick of the day? blog.shakirm.com/ml-series/tric…

English

0

39

4.8K

Anthony GX-Chen retweetledi

NYU Center for Data Science@NYUDataScience·3 Nis

Why do AI models repeat the same answers? CDS PhD student Anthony GX-Chen (@AntChen_), @bicycleman15, @JeffGuo__, @rob_fergus, & CDS Assoc Prof Rajesh Ranganath, show that diversity loss and mode collapse are built-in features of reinforcement learning. nyudatascience.medium.com/why-are-ai-ans…

English

7

11

7.4K

Anthony GX-Chen@AntChen_·19 Mar

@ahatamiz1 @jeanfrancois287 Nice write up :) You may enjoy our paper which discussed many of the same problems, and make use of the odds ratio to fix it Re: entropy reg, perhaps it’s easier to think of it also as KL reg, but to a uniform prior?

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English

0

7

336

Ali Hatamizadeh@ahatamiz1·18 Mar

🧵 In RLVR, KL is often solving the wrong problem. In RLHF, KL regularization makes sense. In RLVR, the thing you usually want is exploration. 1/ In RLHF, a KL penalty makes sense: You optimize something like: J(θ) = E[R_φ(y|x)] - β KL(π_θ || π_ref) where R_φ is a learned reward model. That reward is a proxy. If you optimize too hard against a proxy, you get reward hacking. The KL term keeps the policy near the reference model. 2/ RLVR is different: In RLVR, the reward is often verifiable: * math answer is right or wrong * code passes tests or fails * a proof checker accepts or rejects So the main problem is usually not “the model is exploiting a learned reward model.” The main problem is often much simpler: the policy collapses onto a narrow set of successful templates and stops exploring. That is an entropy problem, not automatically a “stay close to the reference model” problem. 3/ The KL penalty builds in a bias toward whatever the reference model already liked: For the KL-regularized objective, the optimal policy has the form: π*(y|x) = (1 / Z(x)) π_ref(y|x) exp(R(y|x) / β) with Z(x) = Σ_y π_ref(y|x) exp(R(y|x) / β) That means reward is not the only thing that matters. Reference probability matters too. If a reasoning strategy gets high reward but the reference model assigns it tiny probability, the KL-regularized optimum still pushes against it. 4/ The bias shows up directly in the odds ratio: For two responses, novel and familiar: # log [ π*(y_novel|x) / π*(y_familiar|x) ] (R_novel - R_familiar)/β + log [ p_ref / p_ref' ] That second term is a fixed prior bias toward the familiar strategy. So the novel strategy only wins if its reward advantage overcomes the reference model’s preference for the familiar one. This does not make novelty literally impossible in general. But it does make it systematically harder, sometimes much harder. 5/ Another common confusion: KL regularization and entropy regularization are not the same thing. KL(π || π_ref) = -H(π) - E_π[log π_ref] So KL mixes together two very different forces: * keep policy entropy high * keep policy similar to the reference The first one helps avoid collapse. The second one anchors the policy to old behavior. If collapse is the real failure mode, then directly rewarding entropy is a cleaner fix. 6/ With an entropy bonus, the objective becomes: J(θ) = E_π[R(y|x)] + α H(π) The corresponding optimum is: π*(y|x) ∝ exp(R(y|x) / α) Now there is no reference anchor. The policy can spread mass across any high-reward region, including strategies the base model did not strongly prefer. That is exactly what you want if you are trying to discover better reasoning, not preserve old reasoning. 7/ Important things to be aware of: RLVR rewards are often more grounded than RLHF rewards, but they are not always perfect. Tests can be incomplete. Graders can be brittle. Format rewards can distort behavior. So this is not KL is always wrong. It is that inn many RLVR settings, KL is solving the wrong primary problem. 8/ The practical question is: What are you actually trying to prevent? If the concern is proxy exploitation from a learned reward model: KL to a reference can make sense. If the concern is entropy collapse in verifiable-reward training: an entropy bonus is more directly targeted. Those are different regimes. They should not default to the same regularizer. 9/ Takeaway: In RLVR: drop the KL, add an entropy bonus, clip your gradients, and let the model think.

English

5

8

123

9.4K

Anthony GX-Chen retweetledi

Jocelyn Shen@jocelynjshen·3 Mar

Excited to share our #CHI2026 paper “Texterial: A Text-as-Material Interaction Paradigm for LLM-Mediated Writing” (done during internship at Microsoft Research) We imagine interacting with LLMs by treating text as a material like plants/clay. 📃arxiv.org/pdf/2603.00452 🧵[1/n]

English

4

24

159

18.3K

Anthony GX-Chen retweetledi

Ankit Anand@ankit_s_anand·25 Şub

Hi Everyone, we are hiring for Ph.D student researchers in the field of ``Search/RL with LLMs'' ideally for discovery. Please respond in this form if you are interested. Please don't reach out directly as i may not be able to reply individually forms.gle/CxY4VzQRdJacLX…

English

7

24

230

23.2K

Anthony GX-Chen@AntChen_·19 Şub

@ye_chenlu Really neat! IIUC I saw you add Gaussian noise to the per layer activations. How do you update the noise scale parameter (re: sigma is learnable)?

English

0

30

Chenlu Ye@ye_chenlu·19 Şub

3/5 🤯 ALP in one line: ONE unified ratio: perturbed policy / inference policy Key: add learnable perturbations to each transformer layer (train-time only; inference policy unchanged). Theory: smooth sharp objective + tighten tail mismatch → stability + better exploration.

English

2

1

5

877

Chenlu Ye@ye_chenlu·19 Şub

1/5 Happy CNY🎊 Still bothered by RL off-policy instability in LLM? Introducing a new way💡Adaptive Layerwise Perturbation (ALP)💡, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! 🔗 Blog: beneficial-curiosity-d98.notion.site/Adaptive-Layer…

English

4

28

145

24.2K

Anthony GX-Chen@AntChen_·14 Şub

@CarolineWang98 @pSC Congrats Caroline! So great to see this out

English

1

14

Anthony GX-Chen@AntChen_·6 Şub

@kimbochen Intuition: small difference in reward + π_ref drive big difference in probability between samples (in the target dist). Referencing another sample let us cancel out these differences to give equal probs We pick a ref sample with high reward and π_ref (offline or within batch)

English

1

28

Kimbo@kimbochen·5 Şub

@AntChen_ What’s the intuition behind referencing a different sample, and how do you pick what sample to use? Great work, thanks for sharing

English

0

1

18

Anthony GX-Chen@AntChen_·4 Şub

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English

14

65

559

50.5K

Anthony GX-Chen@AntChen_·6 Şub

@siddarthv66 @Teknium Interesting! Curious to hear more about your intuition in the non-asymptotic regime IIUC, are you saying token entropy tends to stay high for a longer time on tasks where the base model have not seen in pre-training? (didn't quite understand "data contaminated")

English

0

67

Siddarth Venkatraman@siddarthv66·5 Şub

>The policy can (will) collapse to be fully deterministic asymptotically in theory yes. In practice absolutely not. It has been observed that entropy collapse isn't really guaranteed at all, especially when allowed large sequence lengths. entropy collapse tends to happen in tasks which are close to data-contaminated with the base model. in fact, adding KL reg can actually destabilize training due to added variance. here's an example run on sokoban reasoning gym env with qwen-4B and 16k context, entropy behaves quite erratically during training, while reward also improves. sequence length remains fairly static throughout.

English

0

2

97

Anthony GX-Chen retweetledi

Jeff Guo@JeffGuo__·5 Şub

Check out work led by @AntChen_ that introduces MARA to mitigate mode collapse in KL/entropy-regularized RL! Focusing on molecular design, there is generally a trade-off between generating high reward samples and sample diversity. (1/2)

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English

9

994

Anthony GX-Chen retweetledi

Nenad Tomasev@weballergy·5 Şub

Interesting insights into diversity in RL and how to improve it.

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English

1

4

1.4K

Anthony GX-Chen@AntChen_·5 Şub

@ParshinShojaee Good Q! If we don't regularize, the optimal RL solution is degenerate. You can place all mass on a single optimal action or split it arbitrarily among several So in practice the optimizer will almost always prefer collapsing to a deterministic solution

English

1

293

Parshin Shojaee@ParshinShojaee·5 Şub

@AntChen_ but we still see the entropy collapse behavior in recent RLVR pipelines that mostly do not use KL?

English

0

399

Anthony GX-Chen@AntChen_·5 Şub

@Teknium Entropy reg = KL reg to a uniform prior, so we'll still have this exponential difference in probabilities behaviour x.com/AntChen_/statu… Separately, if we don't regularize at all, the optimal solution is degenerate. The policy can (will) collapse to be fully deterministic

First, linear differences in rewards -> exponential differences in probabilities. With low KL reg (e.g. 1e-3), we effectively have a single solution (no diversity). N.B. Entropy regularization has this problem as well. [6/n]

English

0

4

537

Teknium (e/λ)@Teknium·5 Şub

@AntChen_ Nobody even uses KL constraints in modern GRPO/RLVR though - so... its just entropy regularization?

English

0

15

1.8K

Anthony GX-Chen@AntChen_·5 Şub

@jubayer_hamid Thanks Jubayer! I enjoyed reading the Polychromic Objective paper :)

English

1

299

Jubayer Ibn Hamid@jubayer_hamid·4 Şub

@AntChen_ Really cool work, love the analysis especially!

English

0

4

560

Anthony GX-Chen retweetledi

Chris Hoang@choang333·4 Şub

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

ZXX

4

52

7.3K

Anthony GX-Chen retweetledi

Jatin Prakash@bicycleman15·4 Şub

Check out our new work on understanding diversity collapse in RL, and how to principally fix it in under 2-3 lines of code! 🤔 one new thing I learnt: intuition of reverse/forward KL being mode covering/mass seeking depends a LOT on the proposal distribution being optimized!

RL causing diversity collapse in generative models (e.g. LLMs) is *not* a training failure. It’s what KL / entropy-regularized RL is provably *designed* to do. The good news: we have a simple, principled fix. Accepted to #ICLR2026 🧵🔽 [1/n]

English

11

49

4.9K

Anthony GX-Chen@AntChen_·4 Şub

Again, a huge thanks to my exceptional co-authors: @bicycleman15, @JeffGuo__ , @rob_fergus and Rajesh Ranganath. We’d love to get your feedback! Paper: arxiv.org/abs/2510.20817 Website: im-ant.github.io/mara #ICLR2026 #RL #MachineLearning #LLMs [15/15]

English

1

23

983

Anthony GX-Chen@AntChen_·4 Şub

That’s it! You should try this simple change to get free diversity. Lots of future works open: by viewing RL through its target distribution, we can have more principled insights on everything from exploration to entropy collapse. It’s all just distribution matching. [14/n]

English