Scale AI

2.2K posts

Scale AI banner
Scale AI

Scale AI

@scale_AI

making AI work

Joined Temmuz 2016
482 Following73.7K Followers
Scale AI
Scale AI@scale_AI·
Hello from our new NYC office! 🗽 We’re officially settled in at One World Trade and excited to keep growing our global team. Want to be part of shaping the future of AI? Join us: scale.com/careers
Scale AI tweet media
English
6
1
85
7.1K
Scale AI retweeted
Scale Labs
Scale Labs@ScaleAILabs·
Versioning, Rewards, and Observations (VeRO) is a new @scale_AI framework for studying a simple question: can coding agents improve other AI agents by editing their prompts, tools, and workflows? Instead of treating this as prompt engineering, VeRO frames agent optimization as a coding agent problem.
Scale Labs tweet media
English
3
10
44
4.2K
Scale AI
Scale AI@scale_AI·
Introducing SWE-Atlas. We built SWE-Atlas as the next evolution of SWE-Bench Pro, expanding agent evaluation beyond change accuracy to better reflect the real, interactive workflows that define software development. Results for Codebase QnA, the first eval under SWE-Atlas that measures how agents understand complex codebases through runtime analysis and multi-file reasoning, are now available. Top models score only ~30%. scale.com/blog/swe-atlas
English
18
56
528
54.7K
Scale AI retweeted
Jason Droege
Jason Droege@jdroege·
AI agents are getting put to work on real tasks. Training them to do that well is harder than it looks. For the past year at @scale_AI, we've been building environments where agents can practice real workflows - simulated worlds that mirror real software, real processes, and real decisions. They can fail, try again, and get better before they touch production. This process is how these agents eventually become reliable for the world’s most important decisions. Today we're officially launching Scale RL Environments. Nearly half of our new training projects already run through them. We're building more this year. See what your agents can learn: scale.com/blog/rl-enviro…
English
3
6
31
4.4K
Scale AI retweeted
Bing Liu
Bing Liu@vbingliu·
Congrats to the @OpenAI team for topping the Audio MultiChallenge speech-to-speech (S2S) leaderboard with the latest gpt-realtime-1.5 release. Audio MultiChallenge (Audio MC) is a benchmark released by @scale_AI to evaluate how end-to-end speech models handle real-world, multi-turn human conversations. The benchmark consists of 452 multi-turn dialogs from 47 speakers, covering four core capabilities: 1. Voice Editing 2. Instruction Retention 3. Inference Memory 4. Self-Coherence Key observations on gpt-realtime-1.5: → S2S performance surpasses its S2T configuration, bridging the "modality gap" we observe in most existing models → Strong gains in Voice Editing, particularly in handling user hesitations and mid-utterance speech repairs → ~15% improvement in Instruction Retention / Instruction Following from the previous gpt-realtime model → Continued challenges in audio-cue inference and long-horizon memory We hope Audio MultiChallenge continues to serve as a testbed for the AI research community to measure progress in natural, multi-turn spoken dialog systems. 📄 Paper: arxiv.org/pdf/2512.14865 📂 Dataset: huggingface.co/datasets/Scale… 📊 Speech-to-Speech (S2S) Leaderboard: scale.com/leaderboard/au… 📊 Speech-to-Text (S2T) Leaderboard: scale.com/leaderboard/au… 🎧 Hear our researchers discuss this benchmark and the frontier of audio model evals: youtube.com/watch?v=9O2_Ff…
YouTube video
YouTube
Peter Bakkum@pbbakkum

gpt-realtime-1.5 is the best native audio model on the Scale AudioMultiChallenge benchmark -- this is a significant jump in capability by this measure. There are models that outperform it but they are reasoning models without native audio output.

English
2
6
50
14.3K
Scale AI
Scale AI@scale_AI·
Honored to see SWE-Bench Pro recognized as the new standard for frontier coding evals. 💜 We built it to address saturation and contamination in earlier benchmarks — raising the bar with a more rigorous measure of agents' problem-solving capabilities and a clearer view of real-world software engineering progress.
OpenAI Developers@OpenAIDevs

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

English
6
9
42
5.5K
Scale AI
Scale AI@scale_AI·
Lots of new models have shipped recently. 👀 When the conversation turns from “what launched” to “how they perform,” the reference points are our leaderboards. Here’s a recap of the latest models evaluated ⬇️
English
5
4
19
4.1K
Scale AI retweeted
Sam Denton
Sam Denton@samueldenton·
2026 is the year agents learn when to ask for help Our research from @scale_AI introduces Long Horizon Augmented Workflows (LHAW), a synthetic data generation pipeline for creating underspecification on ANY dataset (yes, you can try this at home!) and evaluating how agents act🧵
Sam Denton tweet media
English
4
11
77
8.1K
Scale AI retweeted
Calvin Zhang
Calvin Zhang@calvincbzhang·
New paper from @scale_AI & @MeridianAgent: SpreadsheetArena 📄 We evaluated 16 LLMs on end-to-end spreadsheet generation via 4,300+ blind pairwise votes. Crucially, we move beyond scalar Elo ratings to decompose the latent preference signal into functional, structural, and stylistic components. 🧵
Spreadsheet Arena@sheetarena

Spreadsheets have entered the arena! ⚔️ Announcing Spreadsheet Arena, the first research platform for human preference rankings on LLM-generated spreadsheets. The results? @AnthropicAI Claude Opus is on top, but the gap is tighter than you’d think. w/ @LTIatCMU, @Cornell, and @scale_ai. 🧵

English
2
4
32
7.5K
Scale AI
Scale AI@scale_AI·
🎙️ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance. We explore what meaningful agent evaluation looks like, where today’s agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.
English
4
5
17
2.3K