Lin Gui

18 posts

Lin Gui

Lin Gui

@ybnbxb

Phd @Uchicago

Katılım Mart 2020
129 Takip Edilen71 Takipçiler
Lin Gui retweetledi
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
On-policy RL has driven the biggest leaps in training coding agents. Extending it to machine learning engineering agents should be a natural next step. But it almost never works. What I mean is, the recipe is right there — standard trajectory-wise GRPO, the same that worked for SWE. However, the problem is that one rollout step on an MLE task may take hours because the agent has to actually train a model on a real dataset at every step (preprocessing, fitting, inference, scoring). So even with the N rollouts in a group running in parallel, a single GRPO run may still take days. Every MLE agent paper I've read has retreated to SFT or offline proxy rewards for exactly this reason, giving up the exploration benefits of on-policy learning. That's why I'm excited about our new paper, SandMLE, which fixes this with a move that sounds almost too reckless to work. The instinct when on-policy RL is too slow is to engineer around it — async rollouts so the trainer doesn't sit idle waiting for slow environments, off-policy or step-wise proxies to avoid running full trajectories at all. But when we profiled where the time was going, the bottleneck had nothing to do with the algorithm. Unlike SWE where execution latency comes from compilation and test logic, MLE latency is overwhelmingly driven by the size of the dataset the ML pipeline has to chew through. Therefore, rather than downsampling existing data (which corrupts evaluation), we built a multi-agent pipeline that procedurally generates diverse synthetic MLE environments from a small seed set. Specifically, we extract the structural DNA of seed tasks (modality, label cardinality, distribution shape), mutate them into new domains (e.g., repurposing animal classification into road damage detection), inject realistic noise, embed deterministic hidden rules connecting features to labels, and construct full evaluation sandboxes with progressive milestone thresholds. Each task is constrained to only 50–200 training samples. The execution speedup is dramatic — average per-step latency drops over 13×, which makes trajectory-wise GRPO go from infeasible to routine. We also designed a dense, milestone-based reward to address the sparse credit assignment problem in long-horizon MLE. The ablation shows this matters — under a sparse reward, the 30B model's medal rate drops from 27.3% to 13.6% and valid submission collapses from 100% to 86.4%. Results across Qwen3-8B, 14B, and 30B-A3B on MLE-bench are consistently strong — 66.9% better performance in medal rate over SFT baselines. It is worth noting that the SFT baselines are not weak— we trained them on high-quality Claude-4.5-Sonnet trajectories. But SandMLE still delivers much larger gains, suggesting that direct environment interaction does teach capabilities that imitation alone does not (as expected). The most convincing evidence to me that the model's intrinsic performance gets improved is the framework-agnostic generalization. We trained exclusively with ReAct but the gains transfer to AIDE, AIRA, and MLE-Agent scaffolds at evaluation time — up to 32.4% better performance in HumanRank on MLE-Dojo. The SFT models, by contrast, are brittle when moved to unfamiliar scaffolds. The 30B SFT model collapses to 17.7% valid submission rate on MLE-Dojo with MLE-Agent, while the 30B SandMLE model achieves 83.9%. SandMLE is teaching genuine engineering reasoning, not scaffold-specific patterns. What I find most interesting beyond the specific result is that none of the hard parts of RL changed here. The algorithm is the same. The reward is conventional. We just shrunk the environment until on-policy learning became affordable. The field has largely treated environment design and RL algorithm design as separate concerns. SandMLE is a concrete case that the environment is itself the lever. When training is too expensive, the instinct is to build cleverer algorithms to tolerate it. However, often the better move is to reshape the environment so the simple algorithm just works. Paper: arxiv.org/pdf/2604.04872
Zhuokai Zhao tweet media
English
5
39
274
26.6K
Lin Gui retweetledi
Yinjie Wang
Yinjie Wang@YinjieW2024·
OpenClaw-RL Technical Report! Make your🦞@openclaw stronger by just using it. We propose a method that combines the advantages of GRPO and OPD, and evalution results. The repo is already 1.7k stars now, feel free to contribute! Come in and have fun~ @MengdiWang10 @LingYang_PU
Yinjie Wang tweet media
English
36
129
716
62.3K
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
I’m publishing a long-form overview of using rubrics for RL tomorrow. Here are all of the papers that it will cover. Am I missing anything? [1] Gunjal, Anisha, et al. "Rubrics as rewards: Reinforcement learning beyond verifiable domains." arXiv preprint arXiv:2507.17746 (2025). [2] Huang, Zenan, et al. "Reinforcement learning with rubric anchors." arXiv preprint arXiv:2508.12790 (2025). [3] Liu, Tianci, et al. "Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment." arXiv preprint arXiv:2510.07743 (2025). [4] Shao, Rulin, et al. "Dr tulu: Reinforcement learning with evolving rubrics for deep research." arXiv preprint arXiv:2511.19399 (2025). [5] Xu, Ran, et al. "Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training." arXiv preprint arXiv:2602.01511 (2026). [6] Xu, Wenyuan, et al. "A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization." arXiv preprint arXiv:2504.04950 (2025). [7] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." Advances in neural information processing systems 36 (2023): 46595-46623. [8] Viswanathan, Vijay, et al. "Checklists are better than reward models for aligning language models." arXiv preprint arXiv:2507.18624 (2025). [9] Mu, Tong, et al. "Rule based rewards for language model safety." Advances in Neural Information Processing Systems 37 (2024): 108877-108901. [10] Gupta, Taneesh, et al. "CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling." Findings of the Association for Computational Linguistics: ACL 2025. 2025. [11] Wu, Mian, et al. "Rlac: Reinforcement learning with adversarial critic for free-form generation tasks." arXiv preprint arXiv:2511.01758 (2025). [12] Xie, Lipeng, et al. "Auto-rubric: Learning to extract generalizable criteria for reward modeling." arXiv preprint arXiv:2510.17314 (2025). [13] Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022). [14] Guan, Melody Y., et al. "Deliberative alignment: Reasoning enables safer language models." arXiv preprint arXiv:2412.16339 (2024). [15] Liu, Yang, et al. "G-eval: NLG evaluation using gpt-4 with better human alignment." arXiv preprint arXiv:2303.16634 (2023). [16] Arora, Rahul K., et al. "Healthbench: Evaluating large language models towards improved human health." arXiv preprint arXiv:2505.08775 (2025). [17] Deshpande, Kaustubh, et al. "Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms." Findings of the Association for Computational Linguistics: ACL 2025. 2025.
Cameron R. Wolfe, Ph.D. tweet media
English
10
47
354
19K
Lin Gui retweetledi
Zihao Wang
Zihao Wang@wzihao12·
On-policy distillation with reverse KL as reward works great—IF you have access to teacher logits. But what if you don't? What if you want to distill from multiple teachers? Our solution: distill teacher guidance into rubrics, then do on-policy RL. Check out our work: arxiv.org/abs/2509.21500
Thinking Machines@thinkymachines

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-policy…

English
2
4
24
3.2K
Lin Gui retweetledi
Chenghao Yang
Chenghao Yang@chrome1996·
Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and cooling down, we encourage meaningful diversity early while preserving quality later. The result is a simple, plug-and-play method that improves sample efficiency and provides consistent gains in RLVR across various models and RL algorithms. EAD also boosts performance for inference-time scaling. 📜 Paper: arxiv.org/abs/2510.05251 🗳️ Upvote on Daily Papers: huggingface.co/papers/2510.05… 🌐 Website: yangalan123.github.io/ead_rlvr/ Code: github.com/yangalan123/EA… Great thanks to my amazing collaborators! @ybnbxb , @chenxiao_yang_ @zhuokaiz @victorveitch Also thanks for the great support from @DSI_UChicago @TTIC_Connect @Meta !
Chenghao Yang tweet mediaChenghao Yang tweet mediaChenghao Yang tweet mediaChenghao Yang tweet media
English
4
19
92
8.2K
Lin Gui retweetledi
Bing Liu
Bing Liu@vbingliu·
New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500
Bing Liu tweet mediaBing Liu tweet media
English
4
40
178
17.6K
Lin Gui retweetledi
David Reber
David Reber@davidpreber·
🧵 RATE: Score Reward Models with Imperfect Rewrites of Rewrites 1/ How do you measure whether a reward model incentivizes helpfulness without accidentally measuring length, complexity, etc? Rewrites of rewrites give good counterfactuals, without needing to list all confounders!
David Reber tweet media
English
1
10
15
2.1K
Lin Gui retweetledi
Yibo Jiang
Yibo Jiang@yibophd·
Are LLMs just doing next token predictions? It is believed that if an LLM can accurately predict the next tokens in a Wikipedia entry, it essentially "learns" the information. But do pre-trained LLMs actually need to understand context sentences to solve this task? The answer is no!
Yibo Jiang tweet mediaYibo Jiang tweet media
English
5
42
192
40.4K
Lin Gui retweetledi
Victor Veitch 🔸
Victor Veitch 🔸@victorveitch·
LLM best-of-n sampling works great in practice---but why? Turns out: it's the best possible policy for maximizing win rate over the base model! Then: we use this to get a truly sweet alignment scheme: easy tweaks, huge gains w @ybnbxb @ggarbacea arxiv.org/abs/2406.00832
Victor Veitch 🔸 tweet mediaVictor Veitch 🔸 tweet media
English
5
20
83
16.2K
Lin Gui retweetledi
Zihao Wang
Zihao Wang@wzihao12·
Prompt engineering is a dark art. To understand limitations, we formalize "what the user intended" through latent concepts. Then we show direct model controls by algebraic operations on suitably chosen representations!github.com/zihao12/concep… w @ybnbxb @jeffNegrea @victorveitch
Zihao Wang tweet mediaZihao Wang tweet mediaZihao Wang tweet media
English
7
17
54
24.7K
Lin Gui
Lin Gui@ybnbxb·
Our method turns out to be robust in the sense that low bias and valid uncertainty quantification can be expected with misestimation of conditional outcomes.
English
0
0
0
0
Lin Gui
Lin Gui@ybnbxb·
How can we use modern language models to understand the causal effects of text? E.g., does being polite in email cause faster responses? This one weird trick lets you do robust causal effect estimation with text data! arxiv.org/pdf/2210.00079… w @victorveitch
English
6
3
17
0
Lin Gui
Lin Gui@ybnbxb·
Solution2: we learn some data representation of text satisfying both unconfoundedness and overlap by supervised representation learning and use it as the confounding part. We then apply non-parametric models for propensities and the double-ML method to obtain the final estimator.
English
0
0
0
0
Lin Gui
Lin Gui@ybnbxb·
Problem 2: overlap violation means we can't use double-ML methods out of the box. Can we still get robustness to slow rates from the language model?
English
0
0
0
0
Lin Gui
Lin Gui@ybnbxb·
Problem: what is even the causal estimand when we don't have overlap? Solution: we pick a well-defined controlled direct effect that captures the meaning of "causal effect of an attribute of text”. It is estimable under some mild assumptions. (See Section 2 and 3 of the paper)
English
0
0
1
0
Lin Gui
Lin Gui@ybnbxb·
Q: Why does overlap violation happen? A: An attribute of the text is determined by the text. So if we directly use the text as the confounding part, P(Treatment=1|text)=0/1.
English
0
0
0
0