Matthew Yang

42 posts

Matthew Yang

@_matthewyang

MSML student @ CMU

شامل ہوئے Ağustos 2024

241 فالونگ100 فالوورز

پن کیا گیا ٹویٹ

Matthew Yang@_matthewyang·21 Oca

Almost nobody does proper credit assignment in RL-on-LLMs 💀 Learning only from the final outcome → punishes good steps 😭 → rewards bad steps 😭😭 🚨New Paper🚨 A new paradigm for credit assignment: LLMs identify their own mistakes ❌ and propose targeted fixes 🎯 🧵[1/n]

English

193

11.1K

Matthew Yang ری ٹویٹ کیا

Ian Wu@ianwu97·5 Şub

1/How can we train LLMs to continually improve their reasoning over test horizons much longer than their training token budgets? Introducing Reasoning Cache (RC), an algorithm that trains LLMs to *extrapolate*.

English

199

12.3K

Matthew Yang ری ٹویٹ کیا

Aviral Kumar@aviral_kumar2·13 Şub

Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We're very excited to release our 4B model "QED-Nano", built via an awesome open collab! Details below🧵⬇️

English

168

21.7K

Matthew Yang@_matthewyang·23 Oca

@rosmine We tried generating interventions with larger models, namely Qwen3-30B-A3B-Instruct (see section 3) and Gemini 2.5 Pro (see Appendix). We find that larger models tend to generate better interventions (row 6 vs. row 5).

English

126

Rosmine@rosmine·23 Oca

@matthewyryang Did you try scaling up at all beyond 4B? Curious if larger models can get more improvement, or if the boost decreases with size

English

122

Matthew Yang@_matthewyang·21 Oca

English

193

11.1K

Matthew Yang@_matthewyang·23 Oca

@dhruvbhatia0 No, it does not, because we run standard online RL after we SFT on the interventions

English

dhruv bhatia@dhruvbhatia0·23 Oca

@matthewyryang does this stunt the models ability to correctly generate the critical tokens without seeing the reference solution? Compared to just pure RL

English

213

Matthew Yang ری ٹویٹ کیا

Aviral Kumar@aviral_kumar2·22 Oca

🚨🚨New paper Scaling RL to complex tasks shows credit assignment is a bottleneck But standard way of fitting PRM + optimizing it is too inefficient to solve it❌ Our idea: use asymmetries in an LLM to let it do its own credit assignment, in natural language w/o PRMs! 🧵⬇️

English

210

12.1K

Matthew Yang ری ٹویٹ کیا

Jack Bai@jackbot_cs·22 Oca

🚨 New Paper Alert 🚨 💥 SFT on hard tasks given reference solution is usually too off-policy, which can cause the training to crash. 🐌 On-policy RL on these hard tasks introduces low sample efficiency, although more stable. 😈 Today, we introduce Intervention Training (InT), an algorithm that avoids shortcomings of both sides. A thread 🧵 1/n

GIF

English

186

13.4K

Matthew Yang@_matthewyang·21 Oca

Thank you to my amazing set of collaborators @jackbot_cs @ianwu97 @geneyang4 @setlur_amrith @aviral_kumar2 for making this happen!!! 🙏🙏🙏 And grateful to end my master’s journey with this project ⛵️🌅😎 🧵[7/n]

English

477

Matthew Yang@_matthewyang·21 Oca

website: intervention-training.github.io paper: arxiv.org/abs/2601.14209 code: github.com/intervention-t… 🧵[6/n]

English

511

Matthew Yang ری ٹویٹ کیا

Jack Bai@jackbot_cs·9 Oca

😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n

English

377

43.3K

Matthew Yang ری ٹویٹ کیا

Aviral Kumar@aviral_kumar2·26 Kas

🚨🚨New blog post led by CMU students: Want to know why LLM RL training plateaus on hard problems & scaling compute may not help? And how to fix this issue? Turns out it stems from a coupling of poor exploration & optimization. Classical ways to explore don't work, but ours does! 🧵⬇️

GIF

English

254

31.2K

Matthew Yang ری ٹویٹ کیا

Andrew Zhao@_AndrewZhao·30 Eyl

paper of the day

English

575

86.1K

Matthew Yang ری ٹویٹ کیا

Gautam Kamath@thegautamkamath·14 Eyl

Anyone who's done a PhD knows the feeling

English

127

7.5K

Matthew Yang ری ٹویٹ کیا

Zheyuan Hu@real_ZheyuanHu·10 Eyl

Introducing RaC: A data collection protocol that boosts data efficiency by 10x compared to some of the best imitation results. Key idea: scale recovery & correction data systematically => policies can reset+retry when acting (consistent self-correct) => better performance. 🧵0/N

English

209

19.6K

Matthew Yang ری ٹویٹ کیا

Aviral Kumar@aviral_kumar2·9 Eyl

🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️

English

706

71K

Matthew Yang ری ٹویٹ کیا

Amrith Setlur@setlur_amrith·5 Eyl

Nice to see ideas in our e3 paper (arxiv.org/pdf/2506.09026): chaining asymmetries to learn meta-behaviors, also work on didactic tasks!

Lifan Yuan@lifan__yuan

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗:husky-morocco-f72.notion.site/From-f-x-and-g…

English

3.1K

دریافت کریں

@rosmine @dhruvbhatia0 @jackbot_cs @ianwu97 @geneyang4 @setlur_amrith @aviral_kumar2 @elonmusk