Ian Wu

54 posts

Ian Wu

Ian Wu

@ianwu97

ML PhD @mldcmu

Pittsburgh Katılım Aralık 2020
212 Takip Edilen255 Takipçiler
Sabitlenmiş Tweet
Ian Wu
Ian Wu@ianwu97·
1/How can we train LLMs to continually improve their reasoning over test horizons much longer than their training token budgets? Introducing Reasoning Cache (RC), an algorithm that trains LLMs to *extrapolate*.
Ian Wu tweet media
English
5
30
199
12.3K
Ian Wu
Ian Wu@ianwu97·
@eliebakouch We studied something similar here: arxiv.org/pdf/2602.03773 Even the 4B Qwen model is good enough to self-summarize with just prompting. So yeah doesn’t surprise me that you can just do RL directly on a bigger/better models like Composer
English
0
0
2
74
Ian Wu
Ian Wu@ianwu97·
@_lewtun Looking at this in more detail, it seems they train to generate better summaries, whereas we train to use summaries. I’m guessing the former is needed for agentic/code stuff where there is a lot of context but maybe less important for math/science.
English
0
0
1
87
Ian Wu
Ian Wu@ianwu97·
@_lewtun Great to see more work in this direction. We really need to move beyond just autoregressive CoT for long horizon tasks.
English
0
0
0
79
Ian Wu retweetledi
Jack Bai
Jack Bai@jackbot_cs·
😈 Today, Microsoft open-sources WebGym: the task set, code, a bunch of visualization tools, and guiding documentations. WebGym is an RL environment with the *first* open-source implementation of the fully asynchronous rollout system designed for multi-step vision-supported web agentic trajectory collection, which speeds up *4x-5x* compared to existing synchronous implementations. This release comes with *300k* realistic web agentic tasks with comprehensive evaluation rubrics and pipeline, together with annotations on difficulty and domains. 🧵 1/6
English
2
10
50
4K
Ian Wu
Ian Wu@ianwu97·
@jburnmurdoch Horrifying. This problem also predates 08/09, which from what I can tell is when Britain began lagging behind its peers. Curious to hear your explanation for this.
English
1
0
1
1.4K
John Burn-Murdoch
John Burn-Murdoch@jburnmurdoch·
I’ve shown another lens on this same thing previously: the UK’s top income decile has had a very rough couple of decades (and the top half as a whole has done much worse than the bottom half). Who works in top-paying jobs? Graduates. That’s the erosion of the premium right there
John Burn-Murdoch tweet media
English
26
101
785
362.4K
John Burn-Murdoch
John Burn-Murdoch@jburnmurdoch·
The real sign British education has failed is the number of people responding to this chart with "that’s what happens when too many people go to university" HE has expanded in all of these countries, and in every one apart from UK that didn’t erode the graduate earnings premium.
Stefan Schubert@StefanFSchubert

Whereas the graduate premium has increased in most rich countries, it has plummeted in Britain since 1997. Earnings for British graduates have shrunk (next pic). ->

English
92
519
2.8K
536.9K
Ian Wu
Ian Wu@ianwu97·
@barrowjoseph A lot of untapped potential with these smaller models. I guess most teams with the resources + motivation + expertise stick with training bigger models since they are more likely to get eye-catching results that way.
English
1
0
1
39
Joe Barrow
Joe Barrow@barrowjoseph·
QED-Nano is crazy to me. The idea that you can post-train a 4B parameter model to do well at *real, challenging* tasks violates my prior assumptions. My coworkers may be tired of me sharing it already.
Ian Wu@ianwu97

We post-trained a 4B parameter model to write Olympiad-level math proofs! Our tiny model, *QED-Nano*, outperforms gpt-oss-120b, and approaches the performance of Gemini 3 Pro when paired with test-time scaffolding. Check out the blogpost: huggingface.co/spaces/lm-prov…

English
1
2
20
1.7K
Ian Wu
Ian Wu@ianwu97·
We post-trained a 4B parameter model to write Olympiad-level math proofs! Our tiny model, *QED-Nano*, outperforms gpt-oss-120b, and approaches the performance of Gemini 3 Pro when paired with test-time scaffolding. Check out the blogpost: huggingface.co/spaces/lm-prov…
Ian Wu tweet media
English
2
13
97
7.1K
Ian Wu
Ian Wu@ianwu97·
@siddarthv66 @_lewtun Yep! We discuss this for verifiable rewards in the original RC paper as well btw. RC trained model + RSA at test time to scale parallel compute -> big gains. RC trained model + RC decoding + RSA at test time -> even bigger gains. arxiv.org/pdf/2602.03773
English
1
0
4
121
Ian Wu
Ian Wu@ianwu97·
11/Does RC training teach a generalizable skill for using guidance? To test this, we use RCT-4B within two existing time-time scaffolds that utilize other forms of guidance (aggregation, verification feedback etc.). RCT-4B achieves large gains, even without specific adaptations.
Ian Wu tweet media
English
1
1
6
494
Ian Wu
Ian Wu@ianwu97·
1/How can we train LLMs to continually improve their reasoning over test horizons much longer than their training token budgets? Introducing Reasoning Cache (RC), an algorithm that trains LLMs to *extrapolate*.
Ian Wu tweet media
English
5
30
199
12.3K