
Haolin Chen
34 posts

Haolin Chen
@HaolinChen11
Salesforce AI Research @SFResearch, Ph.D. in Applied Mathematics @ucdavis


How can we boost LLM agents’ generalizability to OOD tasks and environments? Check out CodeGym, our new project for synthesizing environments for LLM agent RL training. CodeGym is a synthetic environment generation framework for reinforcement learning on multi-turn tool-use tasks. It automatically converts static coding problems into interactive and verifiable RL training environments. Training in CodeGym leads to strong OOD generalization — for example, a Qwen2.5-32B-Instruct model achieved an 8.7-point absolute accuracy gain on τ-Bench! We’ve just released the paper, synthesis pipeline, and dataset: 📄 Paper: arxiv.org/abs/2509.17325 💻 Project: github.com/StigLidu/CodeG… 📊 Dataset: huggingface.co/datasets/Vanis… 📷 More details in the thread👇



🚀🚀🚀 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels We at @SFResearch build an automated pipeline that converts raw web text into verifiable QA pairs, filtered and verified by LLMs, then use Group Relative Policy Optimization (GRPO) to train models directly on this reward-driven data. The result: models trained on Webscale-RL outperform continual pretraining and data-refinement baselines — while using up to 100× fewer tokens. The gains are most pronounced in reasoning, math, and factual QA tasks. Beyond benchmarks, the key shift is conceptual: RL is no longer just a post-training alignment trick — it’s becoming a core optimization stage inside the LLM pretraining loop. This points toward a future of mid-training RL, where large-scale synthetic or automatically verified datasets provide structured reward signals long before human feedback fine-tuning. 🧩 Webscale-RL hints at a new pretraining paradigm — one that learns not just from text, but from reward. Paper: bit.ly/3IFuMhf Code: bit.ly/42AVpdX Data: bit.ly/4h5lVBS

This signals a training paradigm shift —Compared with continual pretraining, turning the pretraining text into RL tasks is a more effective approach with up to 100x token savings! Breakthrough work led by @ZhepengCen

Today, my team at @SFResearch released Webscale-RL — a data-synthesis pipeline + 1.1M RL tasks that turn any web text into RL environments. 🎯 Same performance as pretraining using only 1% of tokens. 100x cost savings! HF🤗: huggingface.co/datasets/Sales… Any questions, let us know!

📣 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels 📣 RL for LLMs faces a critical data bottleneck: existing RL datasets are <10B tokens while pretraining uses >1T tokens. Our Webscale-RL pipeline solves this by automatically converting pretraining documents into 1.2M verifiable QA pairs across 9+ domains. 📄 Paper: bit.ly/3IFuMhf 💻 Code: bit.ly/42AVpdX 📊 Dataset: bit.ly/4h5lVBS Results: 100× more token-efficient than continual pretraining with significant performance gains on MMLU-pro, BigBench, and mathematical reasoning benchmarks 📈 Work by Zhepeng Cen (@zhepengcen), Haolin Chen (@HaolinChen11), Shiyu Wang (@shiyu04490786), Zuxin Liu (@LiuZuxin), Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong (@CaimingXiong), Huan Wang (@huan__wang), Weiran Yao (@iscreamnearby) #FutureOfAI #EnterpriseAI #ReinforcementLearning #MachineLearning

🚀 Scaling RL to Pretraining Levels with Webscale-RL RL for LLMs has been bottlenecked by tiny datasets (<10B tokens) vs pretraining (>1T). Our Webscale-RL pipeline converts pretraining text into diverse RL-ready QA data — scaling RL to pretraining levels! All codes and datasets are open-source! Paper: arxiv.org/abs/2510.06499 ✨ Key features: • Converts web-scale corpus into millions of verifiable QA pairs • Preserves pretraining-level diversity across 9 domains • Trains up to 100× more token-efficient than continual pretraining • Powers the Webscale-RL dataset (1.2 M examples) for scalable RL Also special thanks to my colleagues in Salesforce AI Research @SFResearch! @HaolinChen11, Shiyu, @LiuZuxin, @huan__wang, @CaimingXiong, @iscreamnearby









🚀 Introducing UserRL: a new framework to train agents that truly assist users through proactive interaction, not just chase static benchmarking scores. 📄 Paper: arxiv.org/pdf/2509.19736 💻 Code: github.com/SalesforceAIRe…


Introducing APIGen-MT: Our agentic pipeline for multi-turn synthetic data generation that produces high-quality training data for tuning AI agents! Try our open-sourced dataset today! 📊 Paper: bit.ly/44tORzx 🤗 Dataset: bit.ly/3GHuQM5 We used APIGen-MT to train our xLAM-2 model family, including xLAM-2-70b-fc-r — still #1 on the BFCL leaderboard with 78.2% accuracy, outperforming frontier models like GPT-4o and Claude 3.5 in function-calling tasks —especially in challenging multi-turn scenarios. 🤝 We're open-sourcing 5K high-quality trajectories and trained models to advance AI agent research. 🧠 xLAM Model Family: bit.ly/4jyj2tu 🔍 BFCL: bit.ly/3WIZdY3

🇨🇦🇨🇦🇨🇦 Welcome to Vancouver! 🇨🇦🇨🇦🇨🇦 13 Paper links below! 👇 The @Salesforce AI Research team brought a baker's dozen AI Research advancements to #NeurIPS2024 this year -- from revolutionizing multimodal agents and time series forecasting to tackling responsible AI evaluation and deployment! 🎯 Attending? Follow us for poster sessions & presentation schedules! 📚 Can't make it? We've curated our complete research collection being showcased this week—bookmark and dive into the work that interests you most! ---- ⭐ Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models 📄 arxiv.org/pdf/2406.14852 ⭐ INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness 📄 arxiv.org/pdf/2407.02518 ⭐ MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 📄 arxiv.org/pdf/2406.11271 ⭐ APIGen: An Automated Pipeline for Generating Verifiable and Diversity Function-Calling Datasets 📄 arxiv.org/pdf/2406.18518 ⭐ Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference 📄 openreview.net/pdf?id=C2xCLze… ⭐ Online Iterative Reinforcement Learning from Human Feedback with General Preference Model 📄 bit.ly/3ZuDC5N ⭐ ThinK: Thinner Key Cache by Query-Driven Pruning 📄 arxiv.org/pdf/2407.21018 ⭐Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts 📄 arxiv.org/pdf/2410.10469 ⭐ GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation 📄 arxiv.org/pdf/2410.10393 ⭐ UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting 📄 arxiv.org/pdf/2406.04975 ⭐ Consent in Crisis: The Rapid Decline of the AI Data Commons 📄 arxiv.org/pdf/2407.14933 ⭐ OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments 📄 arxiv.org/pdf/2404.07972 ⭐ Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? 📄 arxiv.org/pdf/2407.10956