Lili

119 posts

Lili banner
Lili

Lili

@lchen915

Ph.D. student @mldcmu. Previously undergrad @berkeley_ai

Katılım Şubat 2021
458 Takip Edilen1.5K Takipçiler
Sabitlenmiş Tweet
Lili
Lili@lchen915·
When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
2
21
164
15.5K
Lili retweetledi
Konwoo Kim
Konwoo Kim@konwookim·
for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute
Konwoo Kim tweet media
English
8
58
363
95.7K
Lili retweetledi
alphaXiv
alphaXiv@askalphaxiv·
"Expanding the Capabilities of RL via Text Feedback (RLTF)" RL for LLMs has a wall that it struggles to get past. As a single scalar reward is too low-bandwidth to tell the model what was wrong, learning would stall on harder but useful tasks. LLM needs more specific guidance. This paper proposes RLTF, where you can replace RLHF's scalar reward with rich text critiques during training, distills those critiques into better first-try answers. So by letting the model see a critique and retry during training, then training it to bake those fixes into its first response, basically learn to spot the mistake it would be critiqued for, RLTF turns this feedback into a skill, especially on tasks where a single reward number can't explain what went wrong.
alphaXiv tweet media
English
7
25
146
6.8K
Lili retweetledi
Tinker
Tinker@tinkerapi·
Standard RL is limited by sparse feedback, but distillation requires rollouts from a teacher. @yus167’s team found a happy medium by training on text feedback from judges, which the model then learns to internalize and predict. x.com/yus167/status/…
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
1
8
51
4.7K
Lili retweetledi
Joan Cabezas
Joan Cabezas@josancamon19·
alright, so you can get more than 1 bit from RL: 1. instead of single turn rollout, do a 2-turn one. try - get critique - try again. 2. use the 2nd try for training conditioning on the input (same as first try) 3. compute CE on the critique as an auxiliary loss methods: - Self Distillation = applies (1) and (2) - Feedback Modeling, no 2nd rollout conditioning trick, applies (3) as well findings: - FM > SD in math and reasoning tasks. - SD > FM in creative writing. Will def try on HammingBench results: - Knights & Knaves: GRPO=0.352 - SD=0.802, FM=0.880. - Shortest Path: GRPO 0.384 - SD=0.830, FM=0.905. - MATH500: DAPO=0.523 - SD=0.548, FM=0.567. - AIME24: DAPO=0.025 → SD=0.088 (best), FM=0.083. Very fun read :) @yus167 @lchen915 @FahimTajwar10 @MunosRemi @pathak2206 @Zanette_ai
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
1
12
116
11.2K
Lili retweetledi
Fahim Tajwar
Fahim Tajwar@FahimTajwar10·
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n
English
15
162
802
202.7K
Lili retweetledi
Andrea Zanette
Andrea Zanette@Zanette_ai·
RL from scalar rewards is inefficient. Our work shows how to leverage the text feedback that's already abundant in human-AI interaction. Two simple methods, Self Distillation and Feedback Modeling, deliver strong gains. I am very excited to see where this paradigm goes!
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
0
14
144
12.5K
Lili
Lili@lchen915·
@Nik__V__ Thanks! Definitely self-distillation is an exciting direction overall. I think the OPSD setting is different in that the teacher is conditioned on the ground-truth solution and not feedback on the model's own generations. We have a more detailed discussion in our related work :)
English
0
0
3
123
Lili
Lili@lchen915·
When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
2
21
164
15.5K
Lili retweetledi
Gokul Swamy
Gokul Swamy@g_k_swamy·
A really thoughtful exploration of how we can get more than one (1) bit of feedback per rollout in RL! In my view at least, methods for going beyond the bottleneck of a single scalar reward are on the critical path to meaningfully better interactive learning.
Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English
2
7
104
9K
Lili
Lili@lchen915·
1) Self Distillation - use the feedback-conditioned policy as the “teacher”, mimic it with RL 2) Feedback Modeling - learn to mimic the feedback itself via SFT (world modeling) We discuss in detail many theoretical & empirical considerations for both methods in our paper :)
English
0
0
13
427
Lili
Lili@lchen915·
Importantly, we want to incorporate this feedback into the LLM weights so that it learns to get it right without any help. (Feedback is available at training time but not during inference). We present a formalization of this setup, RL from Text Feedback, and two methods:
English
1
0
13
490
Lili retweetledi
Andrea Zanette
Andrea Zanette@Zanette_ai·
I’m recruiting several PhD students at Carnegie Mellon University! If you’re interested in LLM reasoning, agents, or diffusion language models, consider applying to the CMU ECE PhD program. Applications are due Dec 15. ece.cmu.edu/admissions/gra…
English
11
104
467
62.9K
Lili retweetledi
Rohan Choudhury
Rohan Choudhury@rchoudhury997·
Excited to release our new preprint - we introduce Adaptive Patch Transformers (APT), a method to speed up vision transformers by using multiple different patch sizes within the same image!
English
10
29
232
29.7K
Lili retweetledi
Yuda Song
Yuda Song@yus167·
🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)
Yuda Song tweet media
English
2
40
140
29.8K