Yannis He

15 posts

Yannis He banner
Yannis He

Yannis He

@yannis__he

AI PM at @scale_ai | Previously @vectorInst, @UofTRobotics, @BainandCompany

San Francisco, CA Katılım Temmuz 2022
60 Takip Edilen25 Takipçiler
Sabitlenmiş Tweet
Yannis He
Yannis He@yannis__he·
We launched SWE Atlas at Coding Agents Conference 2026, held at the Computer History Museum After the success of SWE-Bench Pro, I kept asking myself: how do we encourage the industry to build towards the rest of the coding ecosystem? SWE-Bench Pro measures whether AI can resolve GitHub issues, but software engineering is much more than that. What about understanding a codebase you've never seen before? Writing tests that actually catch bugs? Refactoring code without breaking things? These are the skills that take engineers years to develop. And we haven't really had a way to measure them. So we built one. Today, we released SWE Atlas: a benchmark to assess how agents understand, validate, and improve software systems inside real repositories. It contains 3 types of tasks: • Codebase QnA: deep code comprehension and reasoning (live now) • Test Writing: writing meaningful tests that exercise real functionality (coming soon) • Refactoring: restructuring code while preserving behavior (coming soon) The codebase QnA focuses on comprehension: 124 tasks across real production repos. No code changes allowed. Just exploration, execution, and understanding. As of today, the top model scores around low 30%. There's a lot of headroom. I'm excited to see what kind of splash we can make with SWE Atlas. We hope this serves as a stepping stone for the community to build the next era of coding agents. SWE Atlas Leaderboard: scale.com/leaderboard/sw… Full Dataset: huggingface.co/datasets/Scale…
Yannis He tweet mediaYannis He tweet media
English
0
1
8
238
Yannis He
Yannis He@yannis__he·
SWE-Bench Pro got into ICML. Who's going to Korea? 👀
Bing Liu@vbingliu

Excited to share that 4 papers from the @scale_AI research team have been accepted to ICML 2026 🎉 Building on our recent work at ICLR and ACL, this continues our push on eval and RL research grounded in real-world tasks. 🛠️ SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arxiv.org/abs/2509.16941 Coding benchmark on enterprise-grade SWE tasks. It contains 1,865 tasks across 41 repositories (public & private), focusing on long-horizon tasks that require multi-file reasoning and patches. The tasks are contamination resistant by design. Key finding: performance drops sharply with task complexity. The biggest gap is not syntax or APIs, but deep codebase understanding and cross-file reasoning. 🔬 SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? arxiv.org/abs/2604.10718 We introduce a benchmark of 405 post-cutoff experimental results across 33 subdomains in physics, chemistry, and biology. Models perform near chance, but more concerningly, they are severely miscalibrated, often expressing high confidence in incorrect predictions. This highlights a fundamental gap between knowledge retrieval and true scientific reasoning. 📋 Online Rubrics Elicitation from Pairwise Comparisons arxiv.org/abs/2510.07284 Static rubrics break as models improve. We propose OnlineRubrics, a method that dynamically elicits new rubric criteria during RL training by contrasting policy outputs with a reference model. Instead of fixing the reward upfront, the rubric evolves with the model, capturing reward hacking behaviors, missing dimensions like transparency and causal reasoning, and failure modes that only emerge mid-training. This points toward a more adaptive approach to evaluation and reward design. 🤖 Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections arxiv.org/abs/2512.14895 We propose On-policy Expert Corrections (OEC), a data generation method that addresses covariate shift in multi-turn agent training. Instead of fine-tuning purely on offline expert trajectories, OEC starts rollouts with the student model and switches to the expert mid-trajectory, exposing the model to its own error states. On SWE-bench, OEC yields a 13-14% relative improvement over standard imitation learning across 7B and 32B model sizes. More to come!

English
1
1
2
143
Yannis He
Yannis He@yannis__he·
Hosting an in-person meetup for researchers and builders working on coding agents, at @scale_AI HQ in SF. (Beautiful office, come see it.) Lightning talks, food, drinks, and time to meet other builders and cook up ideas. Got a recent research discovery or product demo to share? Reach out to be a speaker too. Same if you know someone who fits. (Plz add a note when connecting.) Speakers (and the people who nominated them) get a small gift. Share if it hits your network. Space is limited. Register: luma.com/sf-agentic-cod…
Yannis He tweet media
English
0
0
3
76
Yannis He
Yannis He@yannis__he·
Ever have trust issues with your AI? 💔 Does your agent ship plausible-but-wrong results and claim it crushed it? Does it gamble through missing info, ambiguity, and contradictions instead of asking? No benchmark measures agent judgment. So we built one. HiL-Bench (Human-in-Loop Benchmark) from @scale_AI measures one thing: Does your agent have the judgment to know WHEN and HOW to ask for help? What we found: 1/ Frontier models solve 89% of tasks with full info, but performance collapses to 4% when realistic gaps are introduced, even with an "ask human" tool available. 2/ Each model fails differently. GPT executes confidently on wrong beliefs. Claude detects it's stuck but submits anyway. Gemini asks often but too broadly. 3/ Judgment is trainable. RL on our Ask-F1 reward improved the Qwen3-32B model by 17 points on SWE tasks and 28 points on SQL tasks. 4/ The skill generalizes. A model trained only on SQL improved on unseen SWE tasks. It didn't learn domain patterns. It learned to detect uncertainty and act on it. Selective escalation, not full autonomy, is the defining skill of a reliable agent. - 📝 Blog: scale.com/blog/hil 📊 Leaderboard: labs.scale.com/leaderboard/hil 📄 Paper: arxiv.org/abs/2604.09408 🤗 Data: huggingface.co/datasets/Scale… 💻 Code: github.com/hilbenchauthor…
Yannis He tweet media
English
0
0
3
40
Yannis He retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
725
1.2K
10.3K
4.5M
Yannis He
Yannis He@yannis__he·
1/ Launching SWE Atlas - Test Writing, our 2nd leaderboard in @scale_AI's SWE Atlas evaluation suite for coding agents. 90 tasks. 11 production repos. Go, Python, C, TypeScript. Prompts written by engineers in natural language, intentionally underspecified. The agent figures out the rest. 2/ Every submission goes through 3 evaluation steps: → Manifest check: does the agent accurately describe what it wrote? → Mutation testing: tests must pass on original code, fail when relevant code is removed → Expert rubric grading: human-authored rubrics on comprehensiveness, placement, conventions You can't game this with test spam. 3/ Findings: → Models don't fail by writing bad tests. They fail by missing the right ones. → Models are great at describing what they wrote. The hard part is knowing what to write. → Test placement is still a real challenge. Models put tests in new files or wrong directories instead of where they belong. 4/ Fully open source: github.com/scaleapi/SWE-A… Leaderboard: labs.scale.com/leaderboard/sw…
Yannis He tweet media
English
0
0
2
80
Yannis He
Yannis He@yannis__he·
Great to see @OpenAI pushing our SWE-Bench Pro forward. Yesterday we shared results on the private set (linked below), built from proprietary commercial codebases that have never been in any training data. Curious how GPT-5.3-Codex stacks up there? Stay tuned. Current ranking for private set ⬇️ x.com/yannis__he/sta…
Sam Altman@sama

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.

English
0
2
4
437
Yannis He
Yannis He@yannis__he·
We evaluated the coding capabilities of the latest models from @OpenAI, @AnthropicAI , and @GoogleDeepMind on @scale_AI’s SWE-Bench Pro private set: a coding agent benchmark built exclusively from proprietary commercial codebases. Each new model beats its predecessor: - GPT-5.2: 23.8% (↑ from 14.9%) - Claude Opus 4.5: 23.4% (↑ from 17.8%) - Gemini 3 Pro: 18.0% (↑ from 10.1%) But on public repos, these same models score 40-46%. This ~2x performance difference tells us something important: Progress is real, so is the generalization gap. What drove these improvements & what drives the next one? Better reasoning? More diverse training data? What coding capabilities do you think models should tackle next? scale.com/leaderboard/sw…
Yannis He tweet media
English
0
2
9
4.8K
Yannis He
Yannis He@yannis__he·
It was a fun journey to lead the creation of this benchmark
Alexandr Wang@alexandr_wang

New, very needed benchmark from @scale_AI: SWE-Bench Pro Includes: - Multi-file edits - 100+ lines changed on average - Complex dependencies across large codebases Current top model scores: - GPT-5: 23.3% - Claude Opus 4.1: 22.7% - Others drop further (<15%)

English
0
0
1
78