Matthew Yang

42 posts

Matthew Yang

@_matthewyang

MSML student @ CMU

가입일 Ağustos 2024

241 팔로잉100 팔로워

고정된 트윗

Matthew Yang@_matthewyang·21 Oca

Almost nobody does proper credit assignment in RL-on-LLMs 💀 Learning only from the final outcome → punishes good steps 😭 → rewards bad steps 😭😭 🚨New Paper🚨 A new paradigm for credit assignment: LLMs identify their own mistakes ❌ and propose targeted fixes 🎯 🧵[1/n]

English

193

11.1K

Matthew Yang 리트윗함

Ian Wu@ianwu97·5 Şub

1/How can we train LLMs to continually improve their reasoning over test horizons much longer than their training token budgets? Introducing Reasoning Cache (RC), an algorithm that trains LLMs to *extrapolate*.

English

199

12.3K

Matthew Yang 리트윗함

Aviral Kumar@aviral_kumar2·13 Şub

Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We're very excited to release our 4B model "QED-Nano", built via an awesome open collab! Details below🧵⬇️

English

168

21.7K

Matthew Yang@_matthewyang·23 Oca

@rosmine We tried generating interventions with larger models, namely Qwen3-30B-A3B-Instruct (see section 3) and Gemini 2.5 Pro (see Appendix). We find that larger models tend to generate better interventions (row 6 vs. row 5).

English

126

Rosmine@rosmine·23 Oca

@matthewyryang Did you try scaling up at all beyond 4B? Curious if larger models can get more improvement, or if the boost decreases with size

English

122

Matthew Yang@_matthewyang·21 Oca

English

193

11.1K

Matthew Yang@_matthewyang·23 Oca

@dhruvbhatia0 No, it does not, because we run standard online RL after we SFT on the interventions

English

dhruv bhatia@dhruvbhatia0·23 Oca

@matthewyryang does this stunt the models ability to correctly generate the critical tokens without seeing the reference solution? Compared to just pure RL

English

213

Matthew Yang 리트윗함

Aviral Kumar@aviral_kumar2·22 Oca

🚨🚨New paper Scaling RL to complex tasks shows credit assignment is a bottleneck But standard way of fitting PRM + optimizing it is too inefficient to solve it❌ Our idea: use asymmetries in an LLM to let it do its own credit assignment, in natural language w/o PRMs! 🧵⬇️

English

210

12.1K

Matthew Yang 리트윗함

Jack Bai@jackbot_cs·22 Oca

🚨 New Paper Alert 🚨 💥 SFT on hard tasks given reference solution is usually too off-policy, which can cause the training to crash. 🐌 On-policy RL on these hard tasks introduces low sample efficiency, although more stable. 😈 Today, we introduce Intervention Training (InT), an algorithm that avoids shortcomings of both sides. A thread 🧵 1/n

GIF

English

186

13.4K

Matthew Yang@_matthewyang·21 Oca

Thank you to my amazing set of collaborators @jackbot_cs @ianwu97 @geneyang4 @setlur_amrith @aviral_kumar2 for making this happen!!! 🙏🙏🙏 And grateful to end my master’s journey with this project ⛵️🌅😎 🧵[7/n]

English

477

Matthew Yang@_matthewyang·21 Oca

website: intervention-training.github.io paper: arxiv.org/abs/2601.14209 code: github.com/intervention-t… 🧵[6/n]

English

511

Matthew Yang 리트윗함

Jack Bai@jackbot_cs·9 Oca

😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n

English

377

43.3K

Matthew Yang 리트윗함

Aviral Kumar@aviral_kumar2·26 Kas

🚨🚨New blog post led by CMU students: Want to know why LLM RL training plateaus on hard problems & scaling compute may not help? And how to fix this issue? Turns out it stems from a coupling of poor exploration & optimization. Classical ways to explore don't work, but ours does! 🧵⬇️

GIF

English

254

31.2K

Matthew Yang 리트윗함

Andrew Zhao@_AndrewZhao·30 Eyl

paper of the day

English

576

86.1K

Matthew Yang 리트윗함

Gautam Kamath@thegautamkamath·14 Eyl

Anyone who's done a PhD knows the feeling

English

127

7.5K

Matthew Yang 리트윗함

Zheyuan Hu@real_ZheyuanHu·10 Eyl

Introducing RaC: A data collection protocol that boosts data efficiency by 10x compared to some of the best imitation results. Key idea: scale recovery & correction data systematically => policies can reset+retry when acting (consistent self-correct) => better performance. 🧵0/N

English

209

19.6K

Matthew Yang 리트윗함

Aviral Kumar@aviral_kumar2·9 Eyl

🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️

English

706

71K

Matthew Yang 리트윗함

Amrith Setlur@setlur_amrith·5 Eyl

Nice to see ideas in our e3 paper (arxiv.org/pdf/2506.09026): chaining asymmetries to learn meta-behaviors, also work on didactic tasks!

Lifan Yuan@lifan__yuan

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗:husky-morocco-f72.notion.site/From-f-x-and-g…

English

3.1K

탐색

@rosmine @dhruvbhatia0 @jackbot_cs @ianwu97 @geneyang4 @setlur_amrith @aviral_kumar2 @elonmusk