Lili (@lchen915) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Lili@lchen915·3 Şub

When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

2

21

164

15.5K

Lili retweetledi

Konwoo Kim@konwookim·20 Mar

for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute

English

8

58

363

95.7K

Lili retweetledi

alphaXiv@askalphaxiv·11 Şub

"Expanding the Capabilities of RL via Text Feedback (RLTF)" RL for LLMs has a wall that it struggles to get past. As a single scalar reward is too low-bandwidth to tell the model what was wrong, learning would stall on harder but useful tasks. LLM needs more specific guidance. This paper proposes RLTF, where you can replace RLHF's scalar reward with rich text critiques during training, distills those critiques into better first-try answers. So by letting the model see a critique and retry during training, then training it to bake those fixes into its first response, basically learn to spot the mistake it would be critiqued for, RLTF turns this feedback into a skill, especially on tasks where a single reward number can't explain what went wrong.

English

7

25

146

6.8K

Lili retweetledi

Tinker@tinkerapi·11 Şub

Standard RL is limited by sparse feedback, but distillation requires rollouts from a teacher. @yus167’s team found a happy medium by training on text feedback from judges, which the model then learns to internalize and predict. x.com/yus167/status/…

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

1

8

51

4.7K

Lili retweetledi

Joan Cabezas@josancamon19·8 Şub

alright, so you can get more than 1 bit from RL: 1. instead of single turn rollout, do a 2-turn one. try - get critique - try again. 2. use the 2nd try for training conditioning on the input (same as first try) 3. compute CE on the critique as an auxiliary loss methods: - Self Distillation = applies (1) and (2) - Feedback Modeling, no 2nd rollout conditioning trick, applies (3) as well findings: - FM > SD in math and reasoning tasks. - SD > FM in creative writing. Will def try on HammingBench results: - Knights & Knaves: GRPO=0.352 - SD=0.802, FM=0.880. - Shortest Path: GRPO 0.384 - SD=0.830, FM=0.905. - MATH500: DAPO=0.523 - SD=0.548, FM=0.567. - AIME24: DAPO=0.025 → SD=0.088 (best), FM=0.083. Very fun read :) @yus167 @lchen915 @FahimTajwar10 @MunosRemi @pathak2206 @Zanette_ai

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

1

12

116

11.2K

Lili retweetledi

Daniel Fein@DanielFein7·8 Şub

Awesome to see LitBench showing improved performance with text feedback

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

0

3

580

Lili retweetledi

Sebastian Russo@sebbrusso·8 Şub

Glad to see folks are finding LitBench useful cc @DanielFein7

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

0

1

2

222

Lili retweetledi

Fahim Tajwar@FahimTajwar10·5 Şub

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n

English

15

162

802

202.7K

Lili retweetledi

Andrea Zanette@Zanette_ai·5 Şub

RL from scalar rewards is inefficient. Our work shows how to leverage the text feedback that's already abundant in human-AI interaction. Two simple methods, Self Distillation and Feedback Modeling, deliver strong gains. I am very excited to see where this paradigm goes!

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

0

14

144

12.5K

Lili retweetledi

Mengning Wu@WuMengning54261·4 Şub

Cool work!

Lili@lchen915

When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:

English

0

2

3

634

Lili@lchen915·4 Şub

@Nik__V__ Thanks! Definitely self-distillation is an exciting direction overall. I think the OPSD setting is different in that the teacher is conditioned on the ground-truth solution and not feedback on the model's own generations. We have a more detailed discussion in our related work :)

English

0

3

123

Nikhil Keetha@Nik__V__·4 Şub

@lchen915 Cool work! Looks like the self distillation shows similar behavior to this: x.com/siyan_zhao/sta…

Siyan Zhao@siyan_zhao

Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation. Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself. 🌐Blog: siyan-zhao.github.io/blog/2026/opsd/ 🧵👇

English

1

0

3

472

Lili@lchen915·3 Şub

When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

2

21

164

15.5K

Lili retweetledi

Gokul Swamy@g_k_swamy·3 Şub

A really thoughtful exploration of how we can get more than one (1) bit of feedback per rollout in RL! In my view at least, methods for going beyond the bottleneck of a single scalar reward are on the critical path to meaningfully better interactive learning.

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

2

7

104

9K

Lili retweetledi

clare ❤️‍🔥@clarejtbirch·3 Şub

@yus167 @lchen915 + we are very proud to have supported this through Tinker grants! @thinkymachines stays winning on open science

English

0

1

3

452

Lili retweetledi

clare ❤️‍🔥@clarejtbirch·3 Şub

New multi-turn work from @yus167 + @lchen915 (and team): RL from Text Feedback. Models can now "pls fix" much better.

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

1

4

13

1.6K

Lili retweetledi

Daman Arora@amuseddaman·3 Şub

Very interesting work

Yuda Song@yus167

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

0

1

35

4K

Lili@lchen915·3 Şub

1) Self Distillation - use the feedback-conditioned policy as the “teacher”, mimic it with RL 2) Feedback Modeling - learn to mimic the feedback itself via SFT (world modeling) We discuss in detail many theoretical & empirical considerations for both methods in our paper :)

English

0

13

427

Lili@lchen915·3 Şub

Importantly, we want to incorporate this feedback into the LLM weights so that it learns to get it right without any help. (Feedback is available at training time but not during inference). We present a formalization of this setup, RL from Text Feedback, and two methods:

English

1

0

13

490

Lili retweetledi

Andrea Zanette@Zanette_ai·24 Kas

I’m recruiting several PhD students at Carnegie Mellon University! If you’re interested in LLM reasoning, agents, or diffusion language models, consider applying to the CMU ECE PhD program. Applications are due Dec 15. ece.cmu.edu/admissions/gra…

English

11

104

467

62.9K

Lili retweetledi

Rohan Choudhury@rchoudhury997·23 Eki

Excited to release our new preprint - we introduce Adaptive Patch Transformers (APT), a method to speed up vision transformers by using multiple different patch sizes within the same image!

English

10

29

232

29.7K

Lili retweetledi

Yuda Song@yus167·15 Eki

🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)

English

2

40

140

29.8K

Lili

Keşfet