Yannis He

15 posts

Yannis He

@yannis__he

AI PM at @scale_ai | Previously @vectorInst, @UofTRobotics, @BainandCompany

San Francisco, CA Katılım Temmuz 2022

60 Takip Edilen25 Takipçiler

Sabitlenmiş Tweet

Yannis He@yannis__he·4 Mar

We launched SWE Atlas at Coding Agents Conference 2026, held at the Computer History Museum After the success of SWE-Bench Pro, I kept asking myself: how do we encourage the industry to build towards the rest of the coding ecosystem? SWE-Bench Pro measures whether AI can resolve GitHub issues, but software engineering is much more than that. What about understanding a codebase you've never seen before? Writing tests that actually catch bugs? Refactoring code without breaking things? These are the skills that take engineers years to develop. And we haven't really had a way to measure them. So we built one. Today, we released SWE Atlas: a benchmark to assess how agents understand, validate, and improve software systems inside real repositories. It contains 3 types of tasks: • Codebase QnA: deep code comprehension and reasoning (live now) • Test Writing: writing meaningful tests that exercise real functionality (coming soon) • Refactoring: restructuring code while preserving behavior (coming soon) The codebase QnA focuses on comprehension: 124 tasks across real production repos. No code changes allowed. Just exploration, execution, and understanding. As of today, the top model scores around low 30%. There's a lot of headroom. I'm excited to see what kind of splash we can make with SWE Atlas. We hope this serves as a stepping stone for the community to build the next era of coding agents. SWE Atlas Leaderboard: scale.com/leaderboard/sw… Full Dataset: huggingface.co/datasets/Scale…

English

238

Yannis He@yannis__he·2d

SWE-Bench Pro got into ICML. Who's going to Korea? 👀

Bing Liu@vbingliu

Excited to share that 4 papers from the @scale_AI research team have been accepted to ICML 2026 🎉 Building on our recent work at ICLR and ACL, this continues our push on eval and RL research grounded in real-world tasks. 🛠️ SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arxiv.org/abs/2509.16941 Coding benchmark on enterprise-grade SWE tasks. It contains 1,865 tasks across 41 repositories (public & private), focusing on long-horizon tasks that require multi-file reasoning and patches. The tasks are contamination resistant by design. Key finding: performance drops sharply with task complexity. The biggest gap is not syntax or APIs, but deep codebase understanding and cross-file reasoning. 🔬 SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? arxiv.org/abs/2604.10718 We introduce a benchmark of 405 post-cutoff experimental results across 33 subdomains in physics, chemistry, and biology. Models perform near chance, but more concerningly, they are severely miscalibrated, often expressing high confidence in incorrect predictions. This highlights a fundamental gap between knowledge retrieval and true scientific reasoning. 📋 Online Rubrics Elicitation from Pairwise Comparisons arxiv.org/abs/2510.07284 Static rubrics break as models improve. We propose OnlineRubrics, a method that dynamically elicits new rubric criteria during RL training by contrasting policy outputs with a reference model. Instead of fixing the reward upfront, the rubric evolves with the model, capturing reward hacking behaviors, missing dimensions like transparency and causal reasoning, and failure modes that only emerge mid-training. This points toward a more adaptive approach to evaluation and reward design. 🤖 Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections arxiv.org/abs/2512.14895 We propose On-policy Expert Corrections (OEC), a data generation method that addresses covariate shift in multi-turn agent training. Instead of fine-tuning purely on offline expert trajectories, OEC starts rollouts with the student model and switches to the expert mid-trajectory, exposing the model to its own error states. On SWE-bench, OEC yields a 13-14% relative improvement over standard imitation learning across 7B and 32B model sizes. More to come!

English

143

Yannis He@yannis__he·3d

Hosting an in-person meetup for researchers and builders working on coding agents, at @scale_AI HQ in SF. (Beautiful office, come see it.) Lightning talks, food, drinks, and time to meet other builders and cook up ideas. Got a recent research discovery or product demo to share? Reach out to be a speaker too. Same if you know someone who fits. (Plz add a note when connecting.) Speakers (and the people who nominated them) get a small gift. Share if it hits your network. Space is limited. Register: luma.com/sf-agentic-cod…

English

Yannis He@yannis__he·21 Nis

Ever have trust issues with your AI? 💔 Does your agent ship plausible-but-wrong results and claim it crushed it? Does it gamble through missing info, ambiguity, and contradictions instead of asking? No benchmark measures agent judgment. So we built one. HiL-Bench (Human-in-Loop Benchmark) from @scale_AI measures one thing: Does your agent have the judgment to know WHEN and HOW to ask for help? What we found: 1/ Frontier models solve 89% of tasks with full info, but performance collapses to 4% when realistic gaps are introduced, even with an "ask human" tool available. 2/ Each model fails differently. GPT executes confidently on wrong beliefs. Claude detects it's stuck but submits anyway. Gemini asks often but too broadly. 3/ Judgment is trainable. RL on our Ask-F1 reward improved the Qwen3-32B model by 17 points on SWE tasks and 28 points on SQL tasks. 4/ The skill generalizes. A model trained only on SQL improved on unseen SWE tasks. It didn't learn domain patterns. It learned to detect uncertainty and act on it. Selective escalation, not full autonomy, is the defining skill of a reliable agent. - 📝 Blog: scale.com/blog/hil 📊 Leaderboard: labs.scale.com/leaderboard/hil 📄 Paper: arxiv.org/abs/2604.09408 🤗 Data: huggingface.co/datasets/Scale… 💻 Code: github.com/hilbenchauthor…

English

Yannis He@yannis__he·18 Nis

@pavibhatter ahahahahhh

Italiano

Pavi Bhatter@pavibhatter·18 Nis

@yannis__he We gonna party like it’s 3012 tonight

English

Yannis He@yannis__he·18 Nis

Show you off

Scale Labs@ScaleAILabs

English

101

Yannis He retweetledi

Alexandr Wang@alexandr_wang·8 Nis

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

725

1.2K

10.3K

4.5M

Yannis He@yannis__he·1 Nis

1/ Launching SWE Atlas - Test Writing, our 2nd leaderboard in @scale_AI's SWE Atlas evaluation suite for coding agents. 90 tasks. 11 production repos. Go, Python, C, TypeScript. Prompts written by engineers in natural language, intentionally underspecified. The agent figures out the rest. 2/ Every submission goes through 3 evaluation steps: → Manifest check: does the agent accurately describe what it wrote? → Mutation testing: tests must pass on original code, fail when relevant code is removed → Expert rubric grading: human-authored rubrics on comprehensiveness, placement, conventions You can't game this with test spam. 3/ Findings: → Models don't fail by writing bad tests. They fail by missing the right ones. → Models are great at describing what they wrote. The hard part is knowing what to write. → Test placement is still a real challenge. Models put tests in new files or wrong directories instead of where they belong. 4/ Fully open source: github.com/scaleapi/SWE-A… Leaderboard: labs.scale.com/leaderboard/sw…

English

Yannis He@yannis__he·10 Mar

We launched @ScaleAILabs, so rapid growth never comes at the cost of research and frontier AI exploration.

Scale Labs@ScaleAILabs

Welcome to the home of all things @scale_AI research — focused on data, evaluation, safety, and post-training that moves frontier models forward. We’ll share benchmarks, insights, and work intended to be useful to the broader research community. labs.scale.com/?utm_source=hu…

English

Yannis He@yannis__he·24 Şub

thx for the reco @OpenAI original blog: openai.com/index/why-we-n…

OpenAI Developers@OpenAIDevs

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

English

125

Yannis He@yannis__he·6 Şub

Great to see @OpenAI pushing our SWE-Bench Pro forward. Yesterday we shared results on the private set (linked below), built from proprietary commercial codebases that have never been in any training data. Curious how GPT-5.3-Codex stacks up there? Stay tuned. Current ranking for private set ⬇️ x.com/yannis__he/sta…

Sam Altman@sama

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.

English

437

Yannis He@yannis__he·4 Şub

We evaluated the coding capabilities of the latest models from @OpenAI, @AnthropicAI , and @GoogleDeepMind on @scale_AI’s SWE-Bench Pro private set: a coding agent benchmark built exclusively from proprietary commercial codebases. Each new model beats its predecessor: - GPT-5.2: 23.8% (↑ from 14.9%) - Claude Opus 4.5: 23.4% (↑ from 17.8%) - Gemini 3 Pro: 18.0% (↑ from 10.1%) But on public repos, these same models score 40-46%. This ~2x performance difference tells us something important: Progress is real, so is the generalization gap. What drove these improvements & what drives the next one? Better reasoning? More diverse training data? What coding capabilities do you think models should tackle next? scale.com/leaderboard/sw…

English

4.8K

Yannis He@yannis__he·17 Oca

It was a fun journey to lead the creation of this benchmark

Alexandr Wang@alexandr_wang

New, very needed benchmark from @scale_AI: SWE-Bench Pro Includes: - Multi-file edits - 100+ lines changed on average - Complex dependencies across large codebases Current top model scores: - GPT-5: 23.3% - Claude Opus 4.1: 22.7% - Others drop further (<15%)

English

Keşfet

@scale_AI @pavibhatter @ScaleAILabs @OpenAI @AnthropicAI @GoogleDeepMind @elonmusk @BarackObama