Scale AI

2.2K posts

Scale AI

@scale_AI

making AI work

Joined Temmuz 2016

482 Following73.7K Followers

Pinned Tweet

Scale AI@scale_AI·9 Mar

👋 @ScaleAILabs a new home for our AI research.

Scale Labs@ScaleAILabs

Welcome to the home of all things @scale_AI research — focused on data, evaluation, safety, and post-training that moves frontier models forward. We’ll share benchmarks, insights, and work intended to be useful to the broader research community. labs.scale.com/?utm_source=hu…

English

14.5K

Scale AI@scale_AI·9h

Hello from our new NYC office! 🗽 We’re officially settled in at One World Trade and excited to keep growing our global team. Want to be part of shaping the future of AI? Join us: scale.com/careers

English

7.1K

Scale AI@scale_AI·15h

@ClaraChengGo @lukas_m_ziegler Thanks for coming out! :)

English

115

ClaraChengGo📍Nvidia GTC@ClaraChengGo·1d

Hey mom! I met @lukas_m_ziegler IRL! At the fantastic game night @scale_AI

English

1.1K

Scale AI retweeted

Scale Labs@ScaleAILabs·12 Mar

Versioning, Rewards, and Observations (VeRO) is a new @scale_AI framework for studying a simple question: can coding agents improve other AI agents by editing their prompts, tools, and workflows? Instead of treating this as prompt engineering, VeRO frames agent optimization as a coding agent problem.

English

4.2K

Scale AI@scale_AI·4 Mar

Full dataset: huggingface.co/datasets/Scale… SWE-Atlas leaderboard: scale.com/leaderboard/sw…

English

4.7K

Scale AI@scale_AI·4 Mar

Introducing SWE-Atlas. We built SWE-Atlas as the next evolution of SWE-Bench Pro, expanding agent evaluation beyond change accuracy to better reflect the real, interactive workflows that define software development. Results for Codebase QnA, the first eval under SWE-Atlas that measures how agents understand complex codebases through runtime analysis and multi-file reasoning, are now available. Top models score only ~30%. scale.com/blog/swe-atlas

English

528

54.7K

Scale AI retweeted

Jason Droege@jdroege·27 Şub

AI agents are getting put to work on real tasks. Training them to do that well is harder than it looks. For the past year at @scale_AI, we've been building environments where agents can practice real workflows - simulated worlds that mirror real software, real processes, and real decisions. They can fail, try again, and get better before they touch production. This process is how these agents eventually become reliable for the world’s most important decisions. Today we're officially launching Scale RL Environments. Nearly half of our new training projects already run through them. We're building more this year. See what your agents can learn: scale.com/blog/rl-enviro…

English

4.4K

Scale AI retweeted

Bing Liu@vbingliu·26 Şub

Congrats to the @OpenAI team for topping the Audio MultiChallenge speech-to-speech (S2S) leaderboard with the latest gpt-realtime-1.5 release. Audio MultiChallenge (Audio MC) is a benchmark released by @scale_AI to evaluate how end-to-end speech models handle real-world, multi-turn human conversations. The benchmark consists of 452 multi-turn dialogs from 47 speakers, covering four core capabilities: 1. Voice Editing 2. Instruction Retention 3. Inference Memory 4. Self-Coherence Key observations on gpt-realtime-1.5: → S2S performance surpasses its S2T configuration, bridging the "modality gap" we observe in most existing models → Strong gains in Voice Editing, particularly in handling user hesitations and mid-utterance speech repairs → ~15% improvement in Instruction Retention / Instruction Following from the previous gpt-realtime model → Continued challenges in audio-cue inference and long-horizon memory We hope Audio MultiChallenge continues to serve as a testbed for the AI research community to measure progress in natural, multi-turn spoken dialog systems. 📄 Paper: arxiv.org/pdf/2512.14865 📂 Dataset: huggingface.co/datasets/Scale… 📊 Speech-to-Speech (S2S) Leaderboard: scale.com/leaderboard/au… 📊 Speech-to-Text (S2T) Leaderboard: scale.com/leaderboard/au… 🎧 Hear our researchers discuss this benchmark and the frontier of audio model evals: youtube.com/watch?v=9O2_Ff…

YouTube

Peter Bakkum@pbbakkum

gpt-realtime-1.5 is the best native audio model on the Scale AudioMultiChallenge benchmark -- this is a significant jump in capability by this measure. There are models that outperform it but they are reasoning models without native audio output.

English

14.3K

Scale AI@scale_AI·24 Şub

Take a closer look at SWE-Bench Pro: youtube.com/watch?v=NUd8fZ…

YouTube

English

3.3K

Scale AI@scale_AI·24 Şub

Honored to see SWE-Bench Pro recognized as the new standard for frontier coding evals. 💜 We built it to address saturation and contamination in earlier benchmarks — raising the bar with a more rigorous measure of agents' problem-solving capabilities and a clearer view of real-world software engineering progress.

OpenAI Developers@OpenAIDevs

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

English

5.5K

Scale AI@scale_AI·20 Şub

Explore more model scores: scale.com/leaderboard

Français

1.7K

Scale AI@scale_AI·20 Şub

SWE-Bench Pro: x.com/Alibaba_Qwen/s…

Qwen@Alibaba_Qwen

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: huggingface.co/collections/Qw… 🤖 ModelScope: modelscope.cn/collections/Qw… 📝 Blog: qwen.ai/blog?id=qwen3-… 📄 Tech report: github.com/QwenLM/Qwen3-C…

English

3.9K

Scale AI@scale_AI·20 Şub

Lots of new models have shipped recently. 👀 When the conversation turns from “what launched” to “how they perform,” the reference points are our leaderboards. Here’s a recap of the latest models evaluated ⬇️

English

4.1K

Scale AI retweeted

Sam Denton@samueldenton·18 Şub

2026 is the year agents learn when to ask for help Our research from @scale_AI introduces Long Horizon Augmented Workflows (LHAW), a synthetic data generation pipeline for creating underspecification on ANY dataset (yes, you can try this at home!) and evaluating how agents act🧵

English

8.1K

Scale AI retweeted

Calvin Zhang@calvincbzhang·14 Şub

New paper from @scale_AI & @MeridianAgent: SpreadsheetArena 📄 We evaluated 16 LLMs on end-to-end spreadsheet generation via 4,300+ blind pairwise votes. Crucially, we move beyond scalar Elo ratings to decompose the latent preference signal into functional, structural, and stylistic components. 🧵

Spreadsheet Arena@sheetarena

Spreadsheets have entered the arena! ⚔️ Announcing Spreadsheet Arena, the first research platform for human preference rankings on LLM-generated spreadsheets. The results? @AnthropicAI Claude Opus is on top, but the gap is tighter than you’d think. w/ @LTIatCMU, @Cornell, and @scale_ai. 🧵

English

7.5K

Scale AI@scale_AI·13 Şub

Full episode: youtube.com/watch?v=iT-DLT…

YouTube

English

2.3K

Scale AI@scale_AI·13 Şub

🎙️ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance. We explore what meaningful agent evaluation looks like, where today’s agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.

English

2.3K

Discover

@ClaraChengGo @lukas_m_ziegler @OpenAI @MeridianAgent @elonmusk @BarackObama @taylorswift13 @cristiano