Lin Gui

18 posts

Lin Gui

@ybnbxb

Phd @Uchicago

Katılım Mart 2020

129 Takip Edilen71 Takipçiler

Lin Gui retweetledi

Zhuokai Zhao@zhuokaiz·7 Nis

On-policy RL has driven the biggest leaps in training coding agents. Extending it to machine learning engineering agents should be a natural next step. But it almost never works. What I mean is, the recipe is right there — standard trajectory-wise GRPO, the same that worked for SWE. However, the problem is that one rollout step on an MLE task may take hours because the agent has to actually train a model on a real dataset at every step (preprocessing, fitting, inference, scoring). So even with the N rollouts in a group running in parallel, a single GRPO run may still take days. Every MLE agent paper I've read has retreated to SFT or offline proxy rewards for exactly this reason, giving up the exploration benefits of on-policy learning. That's why I'm excited about our new paper, SandMLE, which fixes this with a move that sounds almost too reckless to work. The instinct when on-policy RL is too slow is to engineer around it — async rollouts so the trainer doesn't sit idle waiting for slow environments, off-policy or step-wise proxies to avoid running full trajectories at all. But when we profiled where the time was going, the bottleneck had nothing to do with the algorithm. Unlike SWE where execution latency comes from compilation and test logic, MLE latency is overwhelmingly driven by the size of the dataset the ML pipeline has to chew through. Therefore, rather than downsampling existing data (which corrupts evaluation), we built a multi-agent pipeline that procedurally generates diverse synthetic MLE environments from a small seed set. Specifically, we extract the structural DNA of seed tasks (modality, label cardinality, distribution shape), mutate them into new domains (e.g., repurposing animal classification into road damage detection), inject realistic noise, embed deterministic hidden rules connecting features to labels, and construct full evaluation sandboxes with progressive milestone thresholds. Each task is constrained to only 50–200 training samples. The execution speedup is dramatic — average per-step latency drops over 13×, which makes trajectory-wise GRPO go from infeasible to routine. We also designed a dense, milestone-based reward to address the sparse credit assignment problem in long-horizon MLE. The ablation shows this matters — under a sparse reward, the 30B model's medal rate drops from 27.3% to 13.6% and valid submission collapses from 100% to 86.4%. Results across Qwen3-8B, 14B, and 30B-A3B on MLE-bench are consistently strong — 66.9% better performance in medal rate over SFT baselines. It is worth noting that the SFT baselines are not weak— we trained them on high-quality Claude-4.5-Sonnet trajectories. But SandMLE still delivers much larger gains, suggesting that direct environment interaction does teach capabilities that imitation alone does not (as expected). The most convincing evidence to me that the model's intrinsic performance gets improved is the framework-agnostic generalization. We trained exclusively with ReAct but the gains transfer to AIDE, AIRA, and MLE-Agent scaffolds at evaluation time — up to 32.4% better performance in HumanRank on MLE-Dojo. The SFT models, by contrast, are brittle when moved to unfamiliar scaffolds. The 30B SFT model collapses to 17.7% valid submission rate on MLE-Dojo with MLE-Agent, while the 30B SandMLE model achieves 83.9%. SandMLE is teaching genuine engineering reasoning, not scaffold-specific patterns. What I find most interesting beyond the specific result is that none of the hard parts of RL changed here. The algorithm is the same. The reward is conventional. We just shrunk the environment until on-policy learning became affordable. The field has largely treated environment design and RL algorithm design as separate concerns. SandMLE is a concrete case that the environment is itself the lever. When training is too expensive, the instinct is to build cleverer algorithms to tolerate it. However, often the better move is to reshape the environment so the simple algorithm just works. Paper: arxiv.org/pdf/2604.04872

English

274

26.6K

Lin Gui retweetledi

Yinjie Wang@YinjieW2024·12 Mar

OpenClaw-RL Technical Report! Make your🦞@openclaw stronger by just using it. We propose a method that combines the advantages of GRPO and OPD, and evalution results. The repo is already 1.7k stars now, feel free to contribute! Come in and have fun~ @MengdiWang10 @LingYang_PU

English

129

716

62.3K

Lin Gui@ybnbxb·16 Şub

@cwolferesearch Also check out our paper: arxiv.org/abs/2509.21500!

English

Cameron R. Wolfe, Ph.D.@cwolferesearch·15 Şub

I’m publishing a long-form overview of using rubrics for RL tomorrow. Here are all of the papers that it will cover. Am I missing anything? [1] Gunjal, Anisha, et al. "Rubrics as rewards: Reinforcement learning beyond verifiable domains." arXiv preprint arXiv:2507.17746 (2025). [2] Huang, Zenan, et al. "Reinforcement learning with rubric anchors." arXiv preprint arXiv:2508.12790 (2025). [3] Liu, Tianci, et al. "Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment." arXiv preprint arXiv:2510.07743 (2025). [4] Shao, Rulin, et al. "Dr tulu: Reinforcement learning with evolving rubrics for deep research." arXiv preprint arXiv:2511.19399 (2025). [5] Xu, Ran, et al. "Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training." arXiv preprint arXiv:2602.01511 (2026). [6] Xu, Wenyuan, et al. "A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization." arXiv preprint arXiv:2504.04950 (2025). [7] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." Advances in neural information processing systems 36 (2023): 46595-46623. [8] Viswanathan, Vijay, et al. "Checklists are better than reward models for aligning language models." arXiv preprint arXiv:2507.18624 (2025). [9] Mu, Tong, et al. "Rule based rewards for language model safety." Advances in Neural Information Processing Systems 37 (2024): 108877-108901. [10] Gupta, Taneesh, et al. "CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling." Findings of the Association for Computational Linguistics: ACL 2025. 2025. [11] Wu, Mian, et al. "Rlac: Reinforcement learning with adversarial critic for free-form generation tasks." arXiv preprint arXiv:2511.01758 (2025). [12] Xie, Lipeng, et al. "Auto-rubric: Learning to extract generalizable criteria for reward modeling." arXiv preprint arXiv:2510.17314 (2025). [13] Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022). [14] Guan, Melody Y., et al. "Deliberative alignment: Reasoning enables safer language models." arXiv preprint arXiv:2412.16339 (2024). [15] Liu, Yang, et al. "G-eval: NLG evaluation using gpt-4 with better human alignment." arXiv preprint arXiv:2303.16634 (2023). [16] Arora, Rahul K., et al. "Healthbench: Evaluating large language models towards improved human health." arXiv preprint arXiv:2505.08775 (2025). [17] Deshpande, Kaustubh, et al. "Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms." Findings of the Association for Computational Linguistics: ACL 2025. 2025.

English

354

19K

Lin Gui retweetledi

Zihao Wang@wzihao12·28 Eki

On-policy distillation with reverse KL as reward works great—IF you have access to teacher logits. But what if you don't? What if you want to distill from multiple teachers? Our solution: distill teacher guidance into rubrics, then do on-policy RL. Check out our work: arxiv.org/abs/2509.21500

Thinking Machines@thinkymachines

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-policy…

English

3.2K

Lin Gui retweetledi

Chenghao Yang@chrome1996·8 Eki

Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and cooling down, we encourage meaningful diversity early while preserving quality later. The result is a simple, plug-and-play method that improves sample efficiency and provides consistent gains in RLVR across various models and RL algorithms. EAD also boosts performance for inference-time scaling. 📜 Paper: arxiv.org/abs/2510.05251 🗳️ Upvote on Daily Papers: huggingface.co/papers/2510.05… 🌐 Website: yangalan123.github.io/ead_rlvr/ Code: github.com/yangalan123/EA… Great thanks to my amazing collaborators! @ybnbxb , @chenxiao_yang_ @zhuokaiz @victorveitch Also thanks for the great support from @DSI_UChicago @TTIC_Connect @Meta !

English

8.2K

Lin Gui retweetledi

Bing Liu@vbingliu·1 Eki

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training! arxiv.org/pdf/2509.21500

English

178

17.6K

Lin Gui retweetledi

David Reber@davidpreber·18 Eki

🧵 RATE: Score Reward Models with Imperfect Rewrites of Rewrites 1/ How do you measure whether a reward model incentivizes helpfulness without accidentally measuring length, complexity, etc? Rewrites of rewrites give good counterfactuals, without needing to list all confounders!

English

2.1K

Lin Gui retweetledi

Yibo Jiang@yibophd·16 Tem

Are LLMs just doing next token predictions? It is believed that if an LLM can accurately predict the next tokens in a Wikipedia entry, it essentially "learns" the information. But do pre-trained LLMs actually need to understand context sentences to solve this task? The answer is no!

English

192

40.4K

Lin Gui retweetledi

Victor Veitch 🔸@victorveitch·4 Haz

LLM best-of-n sampling works great in practice---but why? Turns out: it's the best possible policy for maximizing win rate over the base model! Then: we use this to get a truly sweet alignment scheme: easy tweaks, huge gains w @ybnbxb @ggarbacea arxiv.org/abs/2406.00832

English

16.2K

Lin Gui retweetledi

Yibo Jiang@yibophd·27 Eki

How is the semantic structure of text encoded in the algebraic structure of embeddings? In our #NeurIPS2023 paper, we show that ``semantic independence'' in language is represented as partial orthogonality of embeddings! arxiv.org/abs/2310.17611 w @itsrainingdata and @victorveitch

English

9.5K

Lin Gui retweetledi

Zihao Wang@wzihao12·5 Şub

Prompt engineering is a dark art. To understand limitations, we formalize "what the user intended" through latent concepts. Then we show direct model controls by algebraic operations on suitably chosen representations!github.com/zihao12/concep… w @ybnbxb @jeffNegrea @victorveitch

English

24.7K

Lin Gui@ybnbxb·2 Ara

Will present this work at Causal ML for Real-World Impact workshop today at NeurIPS. Welcome to stop by in room 295-296:)

Lin Gui@ybnbxb

How can we use modern language models to understand the causal effects of text? E.g., does being polite in email cause faster responses? This one weird trick lets you do robust causal effect estimation with text data! arxiv.org/pdf/2210.00079… w @victorveitch

English

Lin Gui@ybnbxb·10 Eki

Our method turns out to be robust in the sense that low bias and valid uncertainty quantification can be expected with misestimation of conditional outcomes.

English

Lin Gui@ybnbxb·10 Eki

English

Lin Gui@ybnbxb·10 Eki

Solution2: we learn some data representation of text satisfying both unconfoundedness and overlap by supervised representation learning and use it as the confounding part. We then apply non-parametric models for propensities and the double-ML method to obtain the final estimator.

English

Lin Gui@ybnbxb·10 Eki

Problem 2: overlap violation means we can't use double-ML methods out of the box. Can we still get robustness to slow rates from the language model?

English

Lin Gui@ybnbxb·10 Eki

Problem: what is even the causal estimand when we don't have overlap? Solution: we pick a well-defined controlled direct effect that captures the meaning of "causal effect of an attribute of text”. It is estimable under some mild assumptions. (See Section 2 and 3 of the paper)

English

Lin Gui@ybnbxb·10 Eki

Q: Why does overlap violation happen? A: An attribute of the text is determined by the text. So if we directly use the text as the confounding part, P(Treatment=1|text)=0/1.

English

Keşfet

@openclaw @MengdiWang10 @LingYang_PU @cwolferesearch @chenxiao_yang_ @zhuokaiz @victorveitch @DSI_UChicago