Yunhao (Robin) Tang

128 posts

Yunhao (Robin) Tang

@robinphysics

Interested in training and RL • @AnthropicAI • Ex reasoning @MistralAI • Llama RL @MetaAI • Gemini post-training and Deep RL research @DeepMind • PhD @Columbia

Beigetreten Kasım 2018

749 Folgt1.6K Follower

Angehefteter Tweet

Yunhao (Robin) Tang@robinphysics·12 Haz

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

English

662

71K

Yunhao (Robin) Tang@robinphysics·12 Haz

@canaesseth @xidulu Indeed! Sorry the post misses the nuance in this regard. It was mostly referring to some very specific recent RL implementations.

English

194

Christian A. Naesseth@canaesseth·12 Haz

@xidulu @robinphysics If you are using the "reparameterization trick" you'll be fine as in that setting the expectation of the gradient of the unbiased loss is the gradient of the KL. The original post misses a bit of nuance in that the statement is sometimes true.

English

335

Yunhao (Robin) Tang@robinphysics·12 Haz

English

662

71K

Yunhao (Robin) Tang@robinphysics·12 Haz

@y0b1byte Thanks so much for the kind words!

English

yobibyte@y0b1byte·12 Haz

arxiv.org/pdf/2506.09477

ZXX

1.5K

yobibyte@y0b1byte·12 Haz

another good one!

English

428

27.8K

Yunhao (Robin) Tang@robinphysics·12 Haz

Taking the k3 estimate as an example (from John's popular blogpost joschu.net/blog/kl-approx…). Contrary to popular practice, differentiating the estimate as a loss ends up enforcing the reverse-KL, but only incidentally. See more details: arxiv.org/pdf/2506.09477

English

4.7K

Yunhao (Robin) Tang retweetet

Mistral AI@MistralAI·10 Haz

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

English

103

442

3.1K

730.9K

Yunhao (Robin) Tang@robinphysics·7 Haz

It was refreshing to see the impact that small algorithmic changes have on the system performance. While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.

English

Yunhao (Robin) Tang@robinphysics·7 Haz

Introducing LlamaRL, a distributed RL framework for training LLM at scale. LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training arxiv.org/pdf/2505.24034

English

301

27.7K

Yunhao (Robin) Tang@robinphysics·4 Ağu

@c_voelcker @charlinelelan Many thanks @c_voelcker for the kind words! Very glad that our past investigation can be of help to your exciting new study here, look forward to reading in more details!

English

109

Claas Voelcker@c_voelcker·2 Ağu

Big shout-out to @robinphysics and @charlinelelan whose amazing works on analyzing training dynamics in RL were hugely inspirational and foundational to our study

Claas Voelcker@c_voelcker

#RepresentationLearning can help training strong RL agents on a variety of tasks. But which feature learning method should you pick? State reconstruction (like Dreamer), or latent self prediction (like SPR)? openreview.net/forum?id=izAJ8… With @tylerkastnr @igilitschenski @sologen

English

Yunhao (Robin) Tang retweetet

Zac Kenton@ZacKenton1·8 Tem

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

English

244

53.2K

Yunhao (Robin) Tang@robinphysics·18 Haz

@PandaAshwinee Thanks! Sorry completely missed the reply here... Indeed H2 is quite surprising. I think it's mainly bc contrastive losses don't work well w/ offline data. That is pi(y_w) / pi(y_l) can increase while both pi(y_w) and pi(y_l) are low. If we change to Bo2, H2 is less prominent.

English

Ashwinee Panda@PandaAshwinee·27 May

@robinphysics Interesting work, nice to see the evaluation of the 5 hypotheses. I'm a bit surprised to see the results of Hypothesis 2. Any more details to share regarding that?

English

192

Yunhao (Robin) Tang@robinphysics·27 May

Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. arxiv.org/pdf/2405.08448… Details in🧵

English

11.3K

Yunhao (Robin) Tang@robinphysics·27 May

Thanks @_akhaliq for promoting our work! Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.

AK@_akhaliq

Understanding the performance gap between online and offline alignment algorithms Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need

English

2.5K

Yunhao (Robin) Tang@robinphysics·27 May

The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices. Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!

English

543

Yunhao (Robin) Tang@robinphysics·27 May

Some takeaways: - There is something more to online than wider coverage of response generation - Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities) - The gap persists across wider variants of algos and network sizes

English

550

Entdecken

@canaesseth @xidulu @y0b1byte @c_voelcker @charlinelelan @PandaAshwinee @_akhaliq @GoogleDeepMind