


Yannis He
15 posts

@yannis__he
AI PM at @scale_ai | Previously @vectorInst, @UofTRobotics, @BainandCompany




Excited to share that 4 papers from the @scale_AI research team have been accepted to ICML 2026 🎉 Building on our recent work at ICLR and ACL, this continues our push on eval and RL research grounded in real-world tasks. 🛠️ SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arxiv.org/abs/2509.16941 Coding benchmark on enterprise-grade SWE tasks. It contains 1,865 tasks across 41 repositories (public & private), focusing on long-horizon tasks that require multi-file reasoning and patches. The tasks are contamination resistant by design. Key finding: performance drops sharply with task complexity. The biggest gap is not syntax or APIs, but deep codebase understanding and cross-file reasoning. 🔬 SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? arxiv.org/abs/2604.10718 We introduce a benchmark of 405 post-cutoff experimental results across 33 subdomains in physics, chemistry, and biology. Models perform near chance, but more concerningly, they are severely miscalibrated, often expressing high confidence in incorrect predictions. This highlights a fundamental gap between knowledge retrieval and true scientific reasoning. 📋 Online Rubrics Elicitation from Pairwise Comparisons arxiv.org/abs/2510.07284 Static rubrics break as models improve. We propose OnlineRubrics, a method that dynamically elicits new rubric criteria during RL training by contrasting policy outputs with a reference model. Instead of fixing the reward upfront, the rubric evolves with the model, capturing reward hacking behaviors, missing dimensions like transparency and causal reasoning, and failure modes that only emerge mid-training. This points toward a more adaptive approach to evaluation and reward design. 🤖 Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections arxiv.org/abs/2512.14895 We propose On-policy Expert Corrections (OEC), a data generation method that addresses covariate shift in multi-turn agent training. Instead of fine-tuning purely on offline expert trajectories, OEC starts rollouts with the student model and switches to the expert mid-trajectory, exposing the model to its own error states. On SWE-bench, OEC yields a 13-14% relative improvement over standard imitation learning across 7B and 32B model sizes. More to come!







Welcome to the home of all things @scale_AI research — focused on data, evaluation, safety, and post-training that moves frontier models forward. We’ll share benchmarks, insights, and work intended to be useful to the broader research community. labs.scale.com/?utm_source=hu…


The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.



New, very needed benchmark from @scale_AI: SWE-Bench Pro Includes: - Multi-file edits - 100+ lines changed on average - Complex dependencies across large codebases Current top model scores: - GPT-5: 23.3% - Claude Opus 4.1: 22.7% - Others drop further (<15%)