Ximing Lu (@GXiming) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Ximing Lu@GXiming·3 Şub

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

13

66

394

107.9K

Ximing Lu retweetledi

Hao Zhang@HaoZhang3438830·28 Mar

Excited to introduce ProRL Agent: Rollout-as-a-Service for RL training of multi-turn LLM agents! 🚀 As we move toward complex agentic tasks, rollout infrastructure is often a bottleneck. We’re decoupling I/O-heavy rollouts from GPU training via a unified HTTP API. Why ProRL Agent? Decoupled & Scalable: Treats rollout as a service, allowing near-linear throughput scaling. System-Level Optimization: Includes load balancing and automated sandbox cleanup for high stability. Integrated: Now part of NVIDIA NeMo Gym to help researchers scale RL pipelines faster. The Results 📈 On SWE-bench-Verified, we saw significant gains: +8.4 on Qwen3-8B +8.2 on Qwen3-14B Proven success across STEM, Math, and General Coding agents. Check out the research and open-source code: 📄 Paper: arxiv.org/pdf/2603.18815💻 Repo: github.com/NVIDIA-NeMo/Pr… Huge thanks to the team and NVIDIA for the support! 👏

English

4

20

136

27.5K

Ximing Lu@GXiming·2 Mar

We’re open-sourcing the data and model behind Golden Goose 🦢✨. Check them out and see how we turn unverifiable internet text 🌐 into large-scale RLVR tasks 😎. 📊 GooseReason-0.7M: huggingface.co/datasets/nvidi… 🤖 GooseReason-4B-Instruct: huggingface.co/nvidia/Nemotro…

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

3

34

266

34K

Ximing Lu@GXiming·3 Mar

@profcelsofontes thanks! In our paper, we found Golden Goose method also very effective in cybersecurity domain. Legal area would be a nice use case to explore as well!

English

0

33

Prof Celso Fontes@profcelsofontes·2 Mar

@GXiming nice ! I have a question: is this method only for creating math/programming or can I use it for a legal area too?

English

1

0

37

Ximing Lu retweetledi

Ximing Lu@GXiming·3 Şub

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

13

66

394

107.9K

Ximing Lu retweetledi

Shizhe Diao@shizhediao·2 Mar

The data and model of Golden Goose are released!

Ximing Lu@GXiming

We’re open-sourcing the data and model behind Golden Goose 🦢✨. Check them out and see how we turn unverifiable internet text 🌐 into large-scale RLVR tasks 😎. 📊 GooseReason-0.7M: huggingface.co/datasets/nvidi… 🤖 GooseReason-4B-Instruct: huggingface.co/nvidia/Nemotro…

English

0

2

13

3.7K

Ximing Lu@GXiming·2 Mar

Thanks for the questions! We jointly train with ProRL data (primarily non-MCQ) and GooseReason data (MCQ) during RL, while most downstream tasks we evaluate are non-MCQ (e.g., math and coding problems). Injecting MCQ-style GooseReason data yields noticeable gains over using ProRL data alone, demonstrating generalizability and transferability. For MegaScience, we exclude the math domain and use only the science portion to construct STEM RLVR tasks, which contains many hard-to-verify instances (e.g., chemical formulas, free-form or open-ended QA).

English

0

1

19

Run-Ze Fan@Vfrz525_·4 Şub

Great to see MegaScience used for RL training! Quick question: why does using only multiple-choice supervision improve computational tasks? Are math problems converted to MC as well? Why not stick with original verifiable RL signals like R1?

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

1

0

6

701

Ximing Lu@GXiming·2 Mar

Thanks for the questions! The source corpora used in this work (e.g., AoPS-Instruct, rStar-Coder, MegaScience) have already undergone filtering and decontamination during their original curation. We further leverage hard-to-verify items in these corpora (e.g., math theorem proofs, free-form or open-ended QA) to synthesize RLVR tasks.

English

0

54

Yongrui Su@ysu_ChatData·8 Şub

@GXiming Cool idea. How do you keep the synthesized tasks from drifting or just inheriting web noise? Are you using a verifier model or a held out eval to measure real transfer, and any safeguards against leaking answer cues from the generator into the reward?

English

1

0

104

Ximing Lu@GXiming·2 Mar

@AndilesAnthony Thanks for the suggestion! We experiment with varying the data while keeping the ProRL recipe fixed in this paper. We’re happy to ablate the effects of different RL algorithms in the next version.

English

0

57

Finna@AndilesAnthony·4 Şub

I think SDPO could be applied by using the ground-truth feedback to form a feedback-conditioned self-teacher, then distilling via KL. For each GooseReason example x with correct option i*, treat the feedback as f = i* (or reveal the correct span y*). Define the teacher as the same model conditioned on feedback: π_T(· | x, f) := π_{\barθ}(· | x, f), and train the student policy π_θ(· | x) by minimizing L(θ) = E_x KL( π_{\barθ}(· | x, f) || π_θ(· | x) ), where KL(p || q) = Σ_i p(i) log(p(i)/q(i)). If π_{\barθ}(· | x, f) collapses to a delta on i*, this reduces to standard cross-entropy: L(θ) = E_x[ -log π_θ(i* | x) ]. But if the feedback-conditioned teacher assigns graded probability mass across options (reflecting near-miss distractors), the student gets a much denser learning signal than the original 0/1 reward, without needing an external LLM judge. Have you tried GooseReason with feedback-conditioned KL distillation (SDPO-style) and compared it head-to-head against GRPO-style RL on the same synthesized distribution? See: arxiv.org/abs/2601.20802

English

3

0

395

Ximing Lu@GXiming·2 Mar

@xyzzysasaki Thanks for the question! Most downstream tasks we evaluate are not multiple-choice (e.g., math and coding problems), yet injecting our MCQ-style GooseReason data yields noticeable gains, demonstrating generalizability and transferability.

English

1

0

1

102

Hiroshi Sasaki@xyzzysasaki·3 Şub

Great point — finite RLVR data saturation is a real bottleneck. Golden Goose’s MCQ-style synthesis is an interesting way to tap into reasoning-rich corpora. One question I’m curious about: how well does this transfer beyond multiple-choice formats into truly open-ended hard exploration? In SlimeTree-RLM, we instead treat failure traces themselves as privileged guidance via slot-memory routing — preventing repeated collapse without relying on synthetic distractors. Would love to compare notes.

English

1

0

2

381

Ximing Lu@GXiming·2 Mar

@soldni Thank you so much for the pointer! We’ll include it in the related work section in our next revision.

English

1

0

1

152

Luca Soldaini 🎀@soldni·3 Şub

@GXiming nice!!! we had similar findings in huggingface.co/datasets/allen… (eventually used it for Olmo 3, although at midtraining stage when models are kinda dumb and have less overfit 😅)

English

1

0

13

1.4K

Ximing Lu retweetledi

David Acuna@davidjesusacu·19 Şub

There is growing excitement about scaling multimodal reasoning 👁️ 🧠 But how do we synthesize visual problems & traces at scale to bootstrap SFT and keep RLVR scaling? Introducing Long Grounded Thoughts (LGT) -A sneak peek of our vision-centric reasoning data factory. 🏭🧠

English

1

7

51

12.9K

Ximing Lu retweetledi

Rohan Paul@rohanpaul_ai·6 Şub

Neat technique to creates lots of cheap, auto-graded “reasoning practice” from normal text, so reinforcement learning does not run out of useful problems so fast. "Golden Goose", a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text. " is a way to train AI using books or articles that are full of explanations, even when you cannot easily check if every explanation step is “true” with a normal checker. Most “training with rewards” setups need a clean yes or no score, like “the math answer matches” or “the code passes tests,” so a lot of good learning text gets ignored. Golden Goose turns a normal explanation into a simple quiz by hiding a key middle part of the reasoning and asking the model to pick the missing chunk from 4 or 5 choices. The correct choice is literally the chunk that was removed, so scoring is easy and automatic, which makes it “verifiable” without needing a math proof checker or a code runner. They used this trick to build GooseReason-0.7M, which is 0.7M training questions across math, coding, and science from sources that were previously hard to use for this kind of training. They show that strong models can “saturate” on a fixed RLVR dataset, meaning more training stops helping and can even hurt, and adding fresh GooseReason data makes progress continue again. On Qwen-4B-Instruct, the added data flips a prior -0.79% drop into a +2.27% gain across 15 benchmarks, and in cybersecurity 180K extra tasks give a +4.44% gain across 3 security benchmarks after 100 RL steps. Verification becomes cheap, while data quality depends on the generator LLM and filtering.

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

6

7

49

8.7K

Ximing Lu retweetledi

Rosinality@rosinality·2 Şub

Building MCQ infilling tasks from corpora.

English

1

2

45

2.6K

Ximing Lu retweetledi

Hyunwoo Kim@hyunw_kim·5 Şub

🚨New paper to level up your 🦞#Clawdbot ?! Bots are now posting your sensitive info in real time. But privacy research is a desert with no data to train better models. That's about to change Enter 🏝️Privasis, the oasis where you can train strong privacy-forward AI with scale✨

English

3

20

93

21.7K

Ximing Lu retweetledi

Joan Cabezas@josancamon19·4 Şub

such a cool and simple technique 🪿✨ 6 steps to synthesize unlimited RLVR data: 1. take a textbook or reasoning rich but unverifiable data 2. identify a contiguous span of crucial reasoning steps 3. replace those with [MASK] 4. treat the removed content as ground truth 5. generate diverse plausible distractors, yet incorrect 6. generate multiple choice fill-in-the-blank questions highlights: - SOTA across 15 different benchmarks like MATH, GPQA-diamond - GooseReason-0.7M, synthesized tasks specially in STEM, - synthesized from SFT* datasets like rStar-Coder, MegaScience - GooseReason-Cyber, 180k tasks taken from fineweb specific crawlers - generalizes to open ended benchmarks (considering that data is purely MCQ, this is pretty cool) - filtered out tasks by difficulty, samples n=16, if all correct, remove - takes models known to be RL saturated and still manages to improve their scores consistently - STEM domain data RLVR is very scarce compared to math and code - before this, there was very little to none cybersec RLVR data, to what other domains does this apply? congrats to the authors @GXiming @davidjesusacu @jaehunjung_com @di_zhang_fdu @shizhediao @ShaokunZhang1 @BrandoCui @MJLiu6666 @hyunw_kim @rajammanabrolu @doyend @YejinChoinka @jankautz looking forward to see if this can be applied to RL environments for more complex workflows! 👀

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

0

3

38

4K

Ximing Lu retweetledi

马东锡 NLP@dongxi_nlp·4 Şub

Golden Goose, 简洁的 scaling up RLVR 的方法，非常喜欢！读文章过程中，一直在想，这太像预训练的 masked langauge modeling 了。无论是 MLM 或 next-token prediction，match-the-corpus 当然可以看作是一种 verifiable task。 Golden Goose 利用这种思想，简洁和低成本的构建了 RLVR 数据，因为：最廉价的 verifier 就在语料库之中。

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

中文

3

6

55

10.6K

Ximing Lu retweetledi

Ximing Lu@GXiming·3 Şub

Paper 📜: huggingface.co/papers/2601.22… Thanks to contributors! @davidjesusacu @jaehunjung_com Jian Hu @di_zhang_fdu @shizhediao Yunheng Zou @ShaokunZhang1 @BrandoCui @MJLiu6666 @hyunw_kim @rajammanabrolu Thanks to leadership! @doyend @YejinChoinka @jankautz Work done at @NVIDIAAI

English

1

8

39

3.3K

Ximing Lu retweetledi

Ximing Lu@GXiming·3 Şub

Training Qwen-4B-Instruct on GooseReason-Cyber for just 100 RL steps yields a +4.44% absolute gain across 3 cybersecurity benchmarks, setting a 🌟new cybersecurity SoTA🌟—surpassing a 7B domain-specialized model with extensive domain-specific pre- and post-training. 🧵8/8

English

1

3

12

1.8K

Ximing Lu retweetledi

Ximing Lu@GXiming·3 Şub

Finally, we deploy Golden Goose 🦢 in the wild 🌲and synthesize RLVR data for 🛡️ cybersecurity—a specialized domain with no open-source RLVR data. Using cybersecurity-related web scrapes primarily from FineWeb, we constructed 🤖GooseReason-Cyber🤖 with 180K RLVR examples. 🧵7/8

English

1

2

7

1.5K

Ximing Lu

Keşfet