Ximing Lu

200 posts

Ximing Lu banner
Ximing Lu

Ximing Lu

@GXiming

PhD @uwcse @uwnlp.

Santa Clara, CA Katılım Şubat 2018
264 Takip Edilen1.6K Takipçiler
Sabitlenmiş Tweet
Ximing Lu
Ximing Lu@GXiming·
There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐
Ximing Lu tweet media
English
13
66
394
107.9K
Ximing Lu retweetledi
Hao Zhang
Hao Zhang@HaoZhang3438830·
Excited to introduce ProRL Agent: Rollout-as-a-Service for RL training of multi-turn LLM agents! 🚀 As we move toward complex agentic tasks, rollout infrastructure is often a bottleneck. We’re decoupling I/O-heavy rollouts from GPU training via a unified HTTP API. Why ProRL Agent? Decoupled & Scalable: Treats rollout as a service, allowing near-linear throughput scaling. System-Level Optimization: Includes load balancing and automated sandbox cleanup for high stability. Integrated: Now part of NVIDIA NeMo Gym to help researchers scale RL pipelines faster. The Results 📈 On SWE-bench-Verified, we saw significant gains: +8.4 on Qwen3-8B +8.2 on Qwen3-14B Proven success across STEM, Math, and General Coding agents. Check out the research and open-source code: 📄 Paper: arxiv.org/pdf/2603.18815💻 Repo: github.com/NVIDIA-NeMo/Pr… Huge thanks to the team and NVIDIA for the support! 👏
Hao Zhang tweet media
English
4
20
136
27.5K
Ximing Lu
Ximing Lu@GXiming·
We’re open-sourcing the data and model behind Golden Goose 🦢✨. Check them out and see how we turn unverifiable internet text 🌐 into large-scale RLVR tasks 😎. 📊 GooseReason-0.7M: huggingface.co/datasets/nvidi… 🤖 GooseReason-4B-Instruct: huggingface.co/nvidia/Nemotro…
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English
3
34
266
34K
Ximing Lu
Ximing Lu@GXiming·
@profcelsofontes thanks! In our paper, we found Golden Goose method also very effective in cybersecurity domain. Legal area would be a nice use case to explore as well!
English
0
0
0
33
Prof Celso Fontes
Prof Celso Fontes@profcelsofontes·
@GXiming nice ! I have a question: is this method only for creating math/programming or can I use it for a legal area too?
English
1
0
0
37
Ximing Lu retweetledi
Ximing Lu
Ximing Lu@GXiming·
There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐
Ximing Lu tweet media
English
13
66
394
107.9K
Ximing Lu
Ximing Lu@GXiming·
Thanks for the questions! We jointly train with ProRL data (primarily non-MCQ) and GooseReason data (MCQ) during RL, while most downstream tasks we evaluate are non-MCQ (e.g., math and coding problems). Injecting MCQ-style GooseReason data yields noticeable gains over using ProRL data alone, demonstrating generalizability and transferability. For MegaScience, we exclude the math domain and use only the science portion to construct STEM RLVR tasks, which contains many hard-to-verify instances (e.g., chemical formulas, free-form or open-ended QA).
English
0
0
1
19
Run-Ze Fan
Run-Ze Fan@Vfrz525_·
Great to see MegaScience used for RL training! Quick question: why does using only multiple-choice supervision improve computational tasks? Are math problems converted to MC as well? Why not stick with original verifiable RL signals like R1?
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English
1
0
6
701
Ximing Lu
Ximing Lu@GXiming·
Thanks for the questions! The source corpora used in this work (e.g., AoPS-Instruct, rStar-Coder, MegaScience) have already undergone filtering and decontamination during their original curation. We further leverage hard-to-verify items in these corpora (e.g., math theorem proofs, free-form or open-ended QA) to synthesize RLVR tasks.
English
0
0
0
54
Yongrui Su
Yongrui Su@ysu_ChatData·
@GXiming Cool idea. How do you keep the synthesized tasks from drifting or just inheriting web noise? Are you using a verifier model or a held out eval to measure real transfer, and any safeguards against leaking answer cues from the generator into the reward?
English
1
0
0
104
Ximing Lu
Ximing Lu@GXiming·
@AndilesAnthony Thanks for the suggestion! We experiment with varying the data while keeping the ProRL recipe fixed in this paper. We’re happy to ablate the effects of different RL algorithms in the next version.
English
0
0
0
57
Finna
Finna@AndilesAnthony·
I think SDPO could be applied by using the ground-truth feedback to form a feedback-conditioned self-teacher, then distilling via KL. For each GooseReason example x with correct option i*, treat the feedback as f = i* (or reveal the correct span y*). Define the teacher as the same model conditioned on feedback: π_T(· | x, f) := π_{\barθ}(· | x, f), and train the student policy π_θ(· | x) by minimizing L(θ) = E_x KL( π_{\barθ}(· | x, f) || π_θ(· | x) ), where KL(p || q) = Σ_i p(i) log(p(i)/q(i)). If π_{\barθ}(· | x, f) collapses to a delta on i*, this reduces to standard cross-entropy: L(θ) = E_x[ -log π_θ(i* | x) ]. But if the feedback-conditioned teacher assigns graded probability mass across options (reflecting near-miss distractors), the student gets a much denser learning signal than the original 0/1 reward, without needing an external LLM judge. Have you tried GooseReason with feedback-conditioned KL distillation (SDPO-style) and compared it head-to-head against GRPO-style RL on the same synthesized distribution? See: arxiv.org/abs/2601.20802
English
3
0
0
395
Ximing Lu
Ximing Lu@GXiming·
@xyzzysasaki Thanks for the question! Most downstream tasks we evaluate are not multiple-choice (e.g., math and coding problems), yet injecting our MCQ-style GooseReason data yields noticeable gains, demonstrating generalizability and transferability.
English
1
0
1
102
Hiroshi Sasaki
Hiroshi Sasaki@xyzzysasaki·
Great point — finite RLVR data saturation is a real bottleneck. Golden Goose’s MCQ-style synthesis is an interesting way to tap into reasoning-rich corpora. One question I’m curious about: how well does this transfer beyond multiple-choice formats into truly open-ended hard exploration? In SlimeTree-RLM, we instead treat failure traces themselves as privileged guidance via slot-memory routing — preventing repeated collapse without relying on synthetic distractors. Would love to compare notes.
English
1
0
2
381
Ximing Lu
Ximing Lu@GXiming·
@soldni Thank you so much for the pointer! We’ll include it in the related work section in our next revision.
English
1
0
1
152
Ximing Lu retweetledi
David Acuna
David Acuna@davidjesusacu·
There is growing excitement about scaling multimodal reasoning 👁️ 🧠 But how do we synthesize visual problems & traces at scale to bootstrap SFT and keep RLVR scaling? Introducing Long Grounded Thoughts (LGT) -A sneak peek of our vision-centric reasoning data factory. 🏭🧠
David Acuna tweet media
English
1
7
51
12.9K
Ximing Lu retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Neat technique to creates lots of cheap, auto-graded “reasoning practice” from normal text, so reinforcement learning does not run out of useful problems so fast. "Golden Goose", a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text. " is a way to train AI using books or articles that are full of explanations, even when you cannot easily check if every explanation step is “true” with a normal checker. Most “training with rewards” setups need a clean yes or no score, like “the math answer matches” or “the code passes tests,” so a lot of good learning text gets ignored. Golden Goose turns a normal explanation into a simple quiz by hiding a key middle part of the reasoning and asking the model to pick the missing chunk from 4 or 5 choices. The correct choice is literally the chunk that was removed, so scoring is easy and automatic, which makes it “verifiable” without needing a math proof checker or a code runner. They used this trick to build GooseReason-0.7M, which is 0.7M training questions across math, coding, and science from sources that were previously hard to use for this kind of training. They show that strong models can “saturate” on a fixed RLVR dataset, meaning more training stops helping and can even hurt, and adding fresh GooseReason data makes progress continue again. On Qwen-4B-Instruct, the added data flips a prior -0.79% drop into a +2.27% gain across 15 benchmarks, and in cybersecurity 180K extra tasks give a +4.44% gain across 3 security benchmarks after 100 RL steps. Verification becomes cheap, while data quality depends on the generator LLM and filtering.
Rohan Paul tweet media
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English
6
7
49
8.7K
Ximing Lu retweetledi
Rosinality
Rosinality@rosinality·
Building MCQ infilling tasks from corpora.
Rosinality tweet media
English
1
2
45
2.6K
Ximing Lu retweetledi
Hyunwoo Kim
Hyunwoo Kim@hyunw_kim·
🚨New paper to level up your 🦞#Clawdbot ?! Bots are now posting your sensitive info in real time. But privacy research is a desert with no data to train better models. That's about to change Enter 🏝️Privasis, the oasis where you can train strong privacy-forward AI with scale✨
Hyunwoo Kim tweet media
English
3
20
93
21.7K
Ximing Lu retweetledi
Joan Cabezas
Joan Cabezas@josancamon19·
such a cool and simple technique 🪿✨ 6 steps to synthesize unlimited RLVR data: 1. take a textbook or reasoning rich but unverifiable data 2. identify a contiguous span of crucial reasoning steps 3. replace those with [MASK] 4. treat the removed content as ground truth 5. generate diverse plausible distractors, yet incorrect 6. generate multiple choice fill-in-the-blank questions highlights: - SOTA across 15 different benchmarks like MATH, GPQA-diamond - GooseReason-0.7M, synthesized tasks specially in STEM, - synthesized from SFT* datasets like rStar-Coder, MegaScience - GooseReason-Cyber, 180k tasks taken from fineweb specific crawlers - generalizes to open ended benchmarks (considering that data is purely MCQ, this is pretty cool) - filtered out tasks by difficulty, samples n=16, if all correct, remove - takes models known to be RL saturated and still manages to improve their scores consistently - STEM domain data RLVR is very scarce compared to math and code - before this, there was very little to none cybersec RLVR data, to what other domains does this apply? congrats to the authors @GXiming @davidjesusacu @jaehunjung_com @di_zhang_fdu @shizhediao @ShaokunZhang1 @BrandoCui @MJLiu6666 @hyunw_kim @rajammanabrolu @doyend @YejinChoinka @jankautz looking forward to see if this can be applied to RL environments for more complex workflows! 👀
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English
0
3
38
4K
Ximing Lu retweetledi
马东锡 NLP
马东锡 NLP@dongxi_nlp·
Golden Goose, 简洁的 scaling up RLVR 的方法,非常喜欢! 读文章过程中,一直在想,这太像预训练的 masked langauge modeling 了。 无论是 MLM 或 next-token prediction,match-the-corpus 当然可以看作是一种 verifiable task。 Golden Goose 利用这种思想,简洁和低成本的构建了 RLVR 数据,因为: 最廉价的 verifier 就在语料库之中。
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

中文
3
6
55
10.6K
Ximing Lu retweetledi
Ximing Lu
Ximing Lu@GXiming·
Training Qwen-4B-Instruct on GooseReason-Cyber for just 100 RL steps yields a +4.44% absolute gain across 3 cybersecurity benchmarks, setting a 🌟new cybersecurity SoTA🌟—surpassing a 7B domain-specialized model with extensive domain-specific pre- and post-training. 🧵8/8
Ximing Lu tweet media
English
1
3
12
1.8K
Ximing Lu retweetledi
Ximing Lu
Ximing Lu@GXiming·
Finally, we deploy Golden Goose 🦢 in the wild 🌲and synthesize RLVR data for 🛡️ cybersecurity—a specialized domain with no open-source RLVR data. Using cybersecurity-related web scrapes primarily from FineWeb, we constructed 🤖GooseReason-Cyber🤖 with 180K RLVR examples. 🧵7/8
English
1
2
7
1.5K