Scale Labs

102 posts

Scale Labs

@ScaleAILabs

welcome to the lab. from the researchers at @scale_AI

Katılım Ekim 2025

107 Takip Edilen1.6K Takipçiler

Scale Labs retweetledi

Scale AI@scale_AI·1d

This month we turn 10. The hard work started in 2016, and it hasn’t stopped. Shortcuts are for losers. Winners welcome. scale.com/careers

English

31.4K

Scale Labs retweetledi

Scale AI@scale_AI·1d

🚨 JUST IN: Scale AI milestone incoming. Stay tuned.

English

4.9K

Scale Labs@ScaleAILabs·2d

Cool to see two of the three SWE leaderboards included in @ArtificialAnlys new Coding Agent Index are ours: SWE Atlas-Codebase QnA and SWE-Bench Pro. We’re still in the early days of evaluating coding agents, and there’s a lot more frontier work ahead. Excited to keep pushing this space forward.

Artificial Analysis@ArtificialAnlys

Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.

English

1.9K

Scale Labs@ScaleAILabs·3d

Congrats to @thinkymachines on the release of TML-Interaction-Small and tying for the top spot on our Audio MC S2S leaderboard! 🥇 Their interaction model scores a 43.4% APR, demonstrating an impressive level of intelligence and long-context awareness compared to existing full-duplex models, without losing responsiveness in conversation.

Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

221

28.5K

Scale Labs@ScaleAILabs·6d

Full paper: labs.scale.com/papers/sweatlas

English

290

Scale Labs@ScaleAILabs·7 May

Today we’re releasing Refactoring, the final leaderboard of our SWE Atlas suite. This new leaderboard is the ultimate test of an agent's ability to restructure code without breaking the system. Claude Opus 4.7 with Claude Code takes the top spot🥇

English

675

103.8K

Scale Labs@ScaleAILabs·8 May

English

363

Scale Labs@ScaleAILabs·8 May

We’ve been sharing a lot lately on where coding agents are headed — now we want to hear from the people building them. If you’re in San Francisco working on coding agents, come hang with us next Wednesday, May 13 at our SFHQ for food, drinks, and convos around all things agentic code. 🤝

English

2.7K

Scale Labs@ScaleAILabs·7 May

Congrats to @OpenAI for taking the top spot on our Audio MultiChallenge S2S leaderboard with the release of GPT‑Realtime‑2 🥇 GPT-Realtime-2 more than doubles GPT-Realtime-1.5 on instruction retention, rising from 36.7% to 70.8% APR, and also stands out on voice editing, especially when users repair or revise what they are saying in real time – crucial for voice agent use cases. Excited to see the pace of progress as voice AI accelerates.

English

617

73.3K

Scale Labs@ScaleAILabs·7 May

One clear takeaway: Even top agents can write functional refactor but they often fail under rigorous professional evaluations – they often leave behind dead code, fail to clean up artifacts, miss crucial call sites, or fail on obscure edge cases. This highlights new avenues for research to build strong open models for the ML community. Full results: labs.scale.com/leaderboard/sw…

English

2.9K

Scale Labs@ScaleAILabs·7 May

Refactoring is hard, even for frontier agents. SWE Atlas refactors are 2× the size of SWE-Bench Pro and 30× SWE-Bench Verified by lines changed.

English

3.3K

Scale Labs@ScaleAILabs·7 May

Coding agents won't stop at writing code. They’re evolving into systems that can navigate the full software engineering workflow end-to-end. And this goes far beyond SWE. Coding agents are becoming the interface layer for how AI systems build tools, shape workflows, and interact with environments around how agents “think” and operate. At the 2026 Coding Agents Conference, @yannis__he, who leads product for coding agent research and data, calls out the parts most miss: - Where coding agents are actually going - Why SWE-Bench-style evals are only the beginning - Why coding will expand into everyone's day-to-day, even for non-developers Watch and weigh in!

English

2.1K

Scale Labs retweetledi

jade@jadechoghari·4 May

LLMs had the internet. Robotics doesn’t, which is why we’re bringing the @scale_AI data engine into the physical world. Robotics won’t be solved by blindly collecting 100x more demos. Raw trajectories need structure (intent, subgoals, failures, recoveries, edge cases, quality signals) or you just average mistakes. At @ScaleAILabs, we’re building that data flywheel, mapping the right data to the right capabilities and scaling what works globally !

English

180

15.7K

Scale Labs@ScaleAILabs·4 May

Check out the full leaderboard: labs.scale.com/leaderboard/hil

English

Scale Labs@ScaleAILabs·4 May

@Kimi_Moonshot Kimi K2.6 asks well, but doesn't ask much. When it asks, the question targets something that really does need clarification. It just doesn't ask often enough to resolve all the gaps. Asking right and asking enough are still separate skills. Consistency matters as much as capability.

English

2.5K

Scale Labs@ScaleAILabs·4 May

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

658

78.7K

Scale Labs@ScaleAILabs·30 Nis

Great connecting with everyone at #ICLR2026 — appreciate all the conversations and time spent together!

English

1.4K

Scale Labs@ScaleAILabs·25 Nis

Stop by the Scale AI booth (#501) at #ICLR2026 at 3PM to hear how rubric-based rewards improve output quality and reduce reward over-optimization. ⬇️

Scale Labs@ScaleAILabs

Reward hacking is a common failure mode in reinforcement learning. The main culprit: the inability of reward models to distinguish between excellent responses from merely good ones. To solve this, we propose a workflow for Effective Rubric-based Reward Modeling.

English

1.5K

Keşfet

@ArtificialAnlys @thinkymachines @OpenAI @yannis__he @scale_AI @Kimi_Moonshot @elonmusk @BarackObama