Yunhao (Robin) Tang

128 posts

Yunhao (Robin) Tang

Yunhao (Robin) Tang

@robinphysics

Interested in training and RL • @AnthropicAI • Ex reasoning @MistralAI • Llama RL @MetaAI • Gemini post-training and Deep RL research @DeepMind • PhD @Columbia

Katılım Kasım 2018
749 Takip Edilen1.6K Takipçiler
Sabitlenmiş Tweet
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Yunhao (Robin) Tang tweet media
English
15
54
662
71K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
@canaesseth @xidulu Indeed! Sorry the post misses the nuance in this regard. It was mostly referring to some very specific recent RL implementations.
English
0
0
3
194
Christian A. Naesseth
Christian A. Naesseth@canaesseth·
@xidulu @robinphysics If you are using the "reparameterization trick" you'll be fine as in that setting the expectation of the gradient of the unbiased loss is the gradient of the KL. The original post misses a bit of nuance in that the statement is sometimes true.
English
2
1
11
335
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Yunhao (Robin) Tang tweet media
English
15
54
662
71K
yobibyte
yobibyte@y0b1byte·
another good one!
yobibyte tweet media
English
1
42
428
27.8K
Yunhao (Robin) Tang retweetledi
Mistral AI
Mistral AI@MistralAI·
Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.
Mistral AI tweet media
English
103
442
3.1K
730.9K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
It was refreshing to see the impact that small algorithmic changes have on the system performance. While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.
Yunhao (Robin) Tang tweet media
English
0
0
15
1K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Introducing LlamaRL, a distributed RL framework for training LLM at scale. LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training arxiv.org/pdf/2505.24034
Yunhao (Robin) Tang tweet media
English
4
47
301
27.7K
Claas Voelcker
Claas Voelcker@c_voelcker·
Big shout-out to @robinphysics and @charlinelelan whose amazing works on analyzing training dynamics in RL were hugely inspirational and foundational to our study
Claas Voelcker@c_voelcker

#RepresentationLearning can help training strong RL agents on a variety of tasks. But which feature learning method should you pick? State reconstruction (like Dreamer), or latent self prediction (like SPR)? openreview.net/forum?id=izAJ8… With @tylerkastnr @igilitschenski @sologen

English
1
0
8
1K
Yunhao (Robin) Tang retweetledi
Zac Kenton
Zac Kenton@ZacKenton1·
Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇
Zac Kenton tweet media
English
5
57
244
53.2K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
@PandaAshwinee Thanks! Sorry completely missed the reply here... Indeed H2 is quite surprising. I think it's mainly bc contrastive losses don't work well w/ offline data. That is pi(y_w) / pi(y_l) can increase while both pi(y_w) and pi(y_l) are low. If we change to Bo2, H2 is less prominent.
English
0
0
0
39
Ashwinee Panda
Ashwinee Panda@PandaAshwinee·
@robinphysics Interesting work, nice to see the evaluation of the 5 hypotheses. I'm a bit surprised to see the results of Hypothesis 2. Any more details to share regarding that?
English
1
0
0
192
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. arxiv.org/pdf/2405.08448… Details in🧵
Yunhao (Robin) Tang tweet media
English
3
15
72
11.3K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Thanks @_akhaliq for promoting our work! Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.
AK@_akhaliq

Understanding the performance gap between online and offline alignment algorithms Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need

English
0
4
16
2.5K
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices. Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!
English
1
0
6
543
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Some takeaways: - There is something more to online than wider coverage of response generation - Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities) - The gap persists across wider variants of algos and network sizes
English
1
0
4
550