Locke Cai

22 posts

Locke Cai banner
Locke Cai

Locke Cai

@couplefire12

CS & Math @ MIT | ML Research Intern @ https://t.co/Nli5KHCIzI

Katılım Ağustos 2023
154 Takip Edilen482 Takipçiler
Sabitlenmiş Tweet
Locke Cai
Locke Cai@couplefire12·
RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇
Locke Cai tweet media
English
24
78
611
177K
Locke Cai retweetledi
Daniel Tan
Daniel Tan@DanielCHTan97·
pretty cool paper on learning to reason from demonstrations arxiv.org/abs/2511.21667 tl;dr instead of SFT'ing on expert demonstrations, try doing inverse RL to learn a critic. This lets you do RL on domains without 'natural' verifiers
English
7
24
252
25.1K
Locke Cai
Locke Cai@couplefire12·
@jackcai1206 Thanks for the question! I believe scaling RARO could unlock new emergent behaviors in non-verifiable tasks, so one cool direction to explore is scaling RARO in: 1. Larger SOTA models 2. Higher reasoning budget 3. More difficult and diverse datasets
English
0
0
3
337
Locke Cai
Locke Cai@couplefire12·
RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇
Locke Cai tweet media
English
24
78
611
177K
Rohan Paul
Rohan Paul@rohanpaul_ai·
This paper shows a Large Language Model (LLM) can learn strong reasoning from expert examples without a verifier. On Countdown, it reaches 54.4% accuracy without a verifier, versus 40.7% from supervised fine tuning. A verifier is a checker that says right or wrong, but many tasks like writing do not have that. Supervised fine tuning mostly copies answers from the dataset, so it does not practice fixing its own mistakes mid solution. RARO learns a reward from expert question and answer pairs, meaning it tries to infer what makes an answer look expert level. It trains a critic that compares an expert answer and a model answer, then picks expert, model, or tie. The model learns by trial and error, getting higher reward when the critic is fooled, while tie rewards keep feedback stable. At test time the critic can rank multiple sampled answers and pick the best, so this works even on open ended tasks. ---- Paper Link – arxiv. org/abs/2511.21667 Paper Title: "Escaping the Verifier: Learning to Reason via Demonstrations"
Rohan Paul tweet media
English
14
33
203
11.6K
alphaXiv
alphaXiv@askalphaxiv·
RL on LLMs with no verifiers & environments... But how? This paper trains LLMs to reason without using verifiers or human preferences, all by turning expert demos into an adversarial game models can then learn self-correcting reasoning, even rivaling verifier-based RL methods!
alphaXiv tweet media
English
11
41
230
28.6K
Locke Cai retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
This is super cool, basically GANs for LLM post-training policy tries to mimic expert answers, critic tries to identify expert answer vs policy answer I'm curious to try this out on some medical tasks... I had similar ideas about 2 years ago which I was discussing in EleutherAI but I was trying to apply it to RLHF and didn't pursue any further... skill issue on my part I guess lol
Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English
12
16
273
26.4K
Locke Cai retweetledi
Christina ¨̮
Christina ¨̮@luoluo·
someone made self-play work! RARO is cool to tackle the unstable adversarial training setup with shared weight update for both generator and discriminator with regularized entropy to “distill” the implicit expert policy by minimizing the reverse KL. Long live GAN lol
Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English
13
20
258
41.3K
Locke Cai
Locke Cai@couplefire12·
@antoine_mln Thanks for sharing this, looks really relevant to our setup! Always glad to see cool RL theory papers
English
0
0
4
740
Antoine Moulin
Antoine Moulin@antoine_mln·
@couplefire12 Nice! Regarding your note on the sample comp, you might be interested in arxiv.org/abs/2505.19946 where we analyze a related algorithm. For contextual bandits (your setup), Alg2 would be close to your Alg1: critic update is ~the same, and we use mirror descent instead of grpo
English
1
0
14
1.4K
Locke Cai
Locke Cai@couplefire12·
@Teknium Thanks for the question! Our poems are directly sourced from huggingface.co/datasets/jnb66…, and we annotate each poem with a prompt via GPT-5. We split the dataset into train/val/test, and we benchmark on test using GPT-5 as judge.
English
0
0
7
1.2K
Locke Cai
Locke Cai@couplefire12·
@guy_dar1 In theory yes, if we train for large number of epochs. However, in our experiments, even when we train for ~8 epochs, we didn't observe memorization. But thanks for bringing this up, we do plan on conducting further experiments on how dataset size/epochs affect performance.
English
1
0
4
121
Guy Dar
Guy Dar@guy_dar1·
@couplefire12 Thanks for the response! But if it trains on the same expert answers it will eventually learn them verbatim right? Am I missing something?
English
1
0
3
118
Locke Cai
Locke Cai@couplefire12·
@guy_dar1 Thanks for the question! While the relativistic critic sees both an expert and a policy answer, the order of the two answers are randomized, so it's incentivized to reason & compare instead of memorizing.
English
1
0
4
662
Guy Dar
Guy Dar@guy_dar1·
@couplefire12 Really cool! How do you avoid memorization though?
English
1
0
3
807
Locke Cai
Locke Cai@couplefire12·
Woke up to some amazing feedback, thanks everyone!! @provilkov and I are working hard to release a plug-and-play RARO repo soon — what domains do you want to see supported? If you have specific model/dataset requests, let us know in the announcement thread! 👇
Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English
1
0
18
1.1K
Locke Cai
Locke Cai@couplefire12·
@provilkov @m_ryabinin @togethercompute @MIT Update: we are working hard to release plug-and-play code for everyone soon, if you have any specific model/dataset requests, let us know here! Thanks everyone for their support!
English
1
0
17
1.9K