Locke Cai (@couplefire12) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Locke Cai@couplefire12·11 Ara

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

24

78

611

177K

Locke Cai retweetledi

Daniel Tan@DanielCHTan97·20 Ara

pretty cool paper on learning to reason from demonstrations arxiv.org/abs/2511.21667 tl;dr instead of SFT'ing on expert demonstrations, try doing inverse RL to learn a critic. This lets you do RL on domains without 'natural' verifiers

English

7

24

252

25.1K

Locke Cai retweetledi

Yacine Mahdid@yacinelearning·15 Ara

“heard you guys had trouble verifying the unverifiable” - GANfather

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

12

10

371

42.3K

Locke Cai@couplefire12·14 Ara

@jackcai1206 Thanks for the question! I believe scaling RARO could unlock new emergent behaviors in non-verifiable tasks, so one cool direction to explore is scaling RARO in: 1. Larger SOTA models 2. Higher reasoning budget 3. More difficult and diverse datasets

English

0

3

337

Jack Cai@jackcai1206·14 Ara

@couplefire12 What are some cool future directions?

English

1

0

1

386

Locke Cai@couplefire12·11 Ara

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

24

78

611

177K

Locke Cai@couplefire12·14 Ara

@rohanpaul_ai Thanks for the highlight!! Our full post + thread here: x.com/couplefire12/s…

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

0

253

Rohan Paul@rohanpaul_ai·14 Ara

This paper shows a Large Language Model (LLM) can learn strong reasoning from expert examples without a verifier. On Countdown, it reaches 54.4% accuracy without a verifier, versus 40.7% from supervised fine tuning. A verifier is a checker that says right or wrong, but many tasks like writing do not have that. Supervised fine tuning mostly copies answers from the dataset, so it does not practice fixing its own mistakes mid solution. RARO learns a reward from expert question and answer pairs, meaning it tries to infer what makes an answer look expert level. It trains a critic that compares an expert answer and a model answer, then picks expert, model, or tie. The model learns by trial and error, getting higher reward when the critic is fooled, while tie rewards keep feedback stable. At test time the critic can rank multiple sampled answers and pick the best, so this works even on open ended tasks. ---- Paper Link – arxiv. org/abs/2511.21667 Paper Title: "Escaping the Verifier: Learning to Reason via Demonstrations"

English

14

33

203

11.6K

alphaXiv@askalphaxiv·13 Ara

RL on LLMs with no verifiers & environments... But how? This paper trains LLMs to reason without using verifiers or human preferences, all by turning expert demos into an adversarial game models can then learn self-correcting reasoning, even rivaling verifier-based RL methods!

English

11

41

230

28.6K

Locke Cai@couplefire12·13 Ara

@askalphaxiv Thanks for the highlight!! Full post/thread here 👇 x.com/couplefire12/s…

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

0

6

626

Locke Cai retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·12 Ara

This is super cool, basically GANs for LLM post-training policy tries to mimic expert answers, critic tries to identify expert answer vs policy answer I'm curious to try this out on some medical tasks... I had similar ideas about 2 years ago which I was discussing in EleutherAI but I was trying to apply it to RLHF and didn't pursue any further... skill issue on my part I guess lol

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

12

16

273

26.4K

Locke Cai retweetledi

Christina ¨̮@luoluo·11 Ara

someone made self-play work! RARO is cool to tackle the unstable adversarial training setup with shared weight update for both generator and discriminator with regularized entropy to “distill” the implicit expert policy by minimizing the reverse KL. Long live GAN lol

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

13

20

258

41.3K

Locke Cai@couplefire12·12 Ara

@antoine_mln Thanks for sharing this, looks really relevant to our setup! Always glad to see cool RL theory papers

English

0

4

740

Antoine Moulin@antoine_mln·12 Ara

@couplefire12 Nice! Regarding your note on the sample comp, you might be interested in arxiv.org/abs/2505.19946 where we analyze a related algorithm. For contextual bandits (your setup), Alg2 would be close to your Alg1: critic update is ~the same, and we use mirror descent instead of grpo

English

1

0

14

1.4K

Locke Cai@couplefire12·12 Ara

@Teknium Thanks for the question! Our poems are directly sourced from huggingface.co/datasets/jnb66…, and we annotate each poem with a prompt via GPT-5. We split the dataset into train/val/test, and we benchmark on test using GPT-5 as judge.

English

0

7

1.2K

Teknium (e/λ)@Teknium·12 Ara

@couplefire12 What is the poetry writing benchmark used

English

1

0

6

1.8K

Locke Cai@couplefire12·12 Ara

@guy_dar1 In theory yes, if we train for large number of epochs. However, in our experiments, even when we train for ~8 epochs, we didn't observe memorization. But thanks for bringing this up, we do plan on conducting further experiments on how dataset size/epochs affect performance.

English

1

0

4

121

Guy Dar@guy_dar1·12 Ara

@couplefire12 Thanks for the response! But if it trains on the same expert answers it will eventually learn them verbatim right? Am I missing something?

English

1

0

3

118

Locke Cai@couplefire12·12 Ara

@guy_dar1 Thanks for the question! While the relativistic critic sees both an expert and a policy answer, the order of the two answers are randomized, so it's incentivized to reason & compare instead of memorizing.

English

1

0

4

662

Guy Dar@guy_dar1·12 Ara

@couplefire12 Really cool! How do you avoid memorization though?

English

1

0

3

807

Locke Cai@couplefire12·12 Ara

Woke up to some amazing feedback, thanks everyone!! @provilkov and I are working hard to release a plug-and-play RARO repo soon — what domains do you want to see supported? If you have specific model/dataset requests, let us know in the announcement thread! 👇

Locke Cai@couplefire12

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

1

0

18

1.1K

Locke Cai@couplefire12·12 Ara

@provilkov @m_ryabinin @togethercompute @MIT Update: we are working hard to release plug-and-play code for everyone soon, if you have any specific model/dataset requests, let us know here! Thanks everyone for their support!

English

1

0

17

1.9K

Locke Cai@couplefire12·11 Ara

🙌 HUGE shoutout to @provilkov & @m_ryabinin for the incredible mentorship! Grateful to my friends at @togethercompute & @MIT for insightful discussions that shaped RARO! 📄 Paper: arxiv.org/abs/2511.21667

English

2

4

55

3.5K

Locke Cai

Keşfet