Matthew Yang

42 posts

Matthew Yang banner
Matthew Yang

Matthew Yang

@_matthewyang

MSML student @ CMU

가입일 Ağustos 2024
241 팔로잉100 팔로워
고정된 트윗
Matthew Yang
Matthew Yang@_matthewyang·
Almost nobody does proper credit assignment in RL-on-LLMs 💀 Learning only from the final outcome → punishes good steps 😭 → rewards bad steps 😭😭 🚨New Paper🚨 A new paradigm for credit assignment: LLMs identify their own mistakes ❌ and propose targeted fixes 🎯 🧵[1/n]
Matthew Yang tweet media
English
8
25
193
11.1K
Matthew Yang 리트윗함
Ian Wu
Ian Wu@ianwu97·
1/How can we train LLMs to continually improve their reasoning over test horizons much longer than their training token budgets? Introducing Reasoning Cache (RC), an algorithm that trains LLMs to *extrapolate*.
Ian Wu tweet media
English
5
30
199
12.3K
Matthew Yang 리트윗함
Aviral Kumar
Aviral Kumar@aviral_kumar2·
Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We're very excited to release our 4B model "QED-Nano", built via an awesome open collab! Details below🧵⬇️
Aviral Kumar tweet media
English
8
26
168
21.7K
Matthew Yang
Matthew Yang@_matthewyang·
@rosmine We tried generating interventions with larger models, namely Qwen3-30B-A3B-Instruct (see section 3) and Gemini 2.5 Pro (see Appendix). We find that larger models tend to generate better interventions (row 6 vs. row 5).
Matthew Yang tweet media
English
0
0
2
126
Rosmine
Rosmine@rosmine·
@matthewyryang Did you try scaling up at all beyond 4B? Curious if larger models can get more improvement, or if the boost decreases with size
English
1
0
1
122
Matthew Yang
Matthew Yang@_matthewyang·
Almost nobody does proper credit assignment in RL-on-LLMs 💀 Learning only from the final outcome → punishes good steps 😭 → rewards bad steps 😭😭 🚨New Paper🚨 A new paradigm for credit assignment: LLMs identify their own mistakes ❌ and propose targeted fixes 🎯 🧵[1/n]
Matthew Yang tweet media
English
8
25
193
11.1K
Matthew Yang
Matthew Yang@_matthewyang·
@dhruvbhatia0 No, it does not, because we run standard online RL after we SFT on the interventions
English
1
0
2
60
dhruv bhatia
dhruv bhatia@dhruvbhatia0·
@matthewyryang does this stunt the models ability to correctly generate the critical tokens without seeing the reference solution? Compared to just pure RL
English
1
0
1
213
Matthew Yang 리트윗함
Aviral Kumar
Aviral Kumar@aviral_kumar2·
🚨🚨New paper Scaling RL to complex tasks shows credit assignment is a bottleneck But standard way of fitting PRM + optimizing it is too inefficient to solve it❌ Our idea: use asymmetries in an LLM to let it do its own credit assignment, in natural language w/o PRMs! 🧵⬇️
Aviral Kumar tweet media
English
7
28
210
12.1K
Matthew Yang 리트윗함
Jack Bai
Jack Bai@jackbot_cs·
🚨 New Paper Alert 🚨 💥 SFT on hard tasks given reference solution is usually too off-policy, which can cause the training to crash. 🐌 On-policy RL on these hard tasks introduces low sample efficiency, although more stable. 😈 Today, we introduce Intervention Training (InT), an algorithm that avoids shortcomings of both sides. A thread 🧵 1/n
GIF
English
5
30
186
13.4K
Matthew Yang 리트윗함
Jack Bai
Jack Bai@jackbot_cs·
😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n
English
13
37
377
43.3K
Matthew Yang 리트윗함
Aviral Kumar
Aviral Kumar@aviral_kumar2·
🚨🚨New blog post led by CMU students: Want to know why LLM RL training plateaus on hard problems & scaling compute may not help? And how to fix this issue? Turns out it stems from a coupling of poor exploration & optimization. Classical ways to explore don't work, but ours does! 🧵⬇️
GIF
English
6
44
254
31.2K
Matthew Yang 리트윗함
Andrew Zhao
Andrew Zhao@_AndrewZhao·
paper of the day
Andrew Zhao tweet media
English
15
34
576
86.1K
Matthew Yang 리트윗함
Gautam Kamath
Gautam Kamath@thegautamkamath·
Anyone who's done a PhD knows the feeling
Gautam Kamath tweet media
English
3
5
127
7.5K
Matthew Yang 리트윗함
Zheyuan Hu
Zheyuan Hu@real_ZheyuanHu·
Introducing RaC: A data collection protocol that boosts data efficiency by 10x compared to some of the best imitation results. Key idea: scale recovery & correction data systematically => policies can reset+retry when acting (consistent self-correct) => better performance. 🧵0/N
English
11
38
209
19.6K
Matthew Yang 리트윗함
Aviral Kumar
Aviral Kumar@aviral_kumar2·
🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️
Aviral Kumar tweet media
English
11
83
706
71K
Matthew Yang 리트윗함