Sabitlenmiş Tweet
Scale AI
2.3K posts

Scale AI retweetledi

The humans stay. That’s the idea behind @scale_ai's new brand campaign.
10 years of building AI has taught us something: the most important decisions belong to humans. The AI that works in decisions of consequence keeps humans at the center.
Going live in SF and NYC. Where to next? 👀


English

This month we turn 10.
The hard work started in 2016, and it hasn’t stopped.
Shortcuts are for losers. Winners welcome.
scale.com/careers
English
Scale AI retweetledi

Proud to share @CDAODoW has expanded its enterprise agreement with Scale AI raising the ceiling from $100M to $500M.
This expansion reflects our continued commitment to accelerating the adoption of AI capabilities across the Pentagon to help America stay prepared, resilient, and strong.
scale.com/blog/Scale-ai-…
English
Scale AI retweetledi

AI pretenders vs. AI contenders. It's those who still haven’t realized reliability is the product vs. those who can deliver reliability and outcomes. That's what the enterprise AI race comes down to. Here's a note I sent the Scale team this week.
Jason Droege@jdroege
English
Scale AI retweetledi

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask?
Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers.
We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard.
Here’s what we’re seeing ⬇️🧵

English

Scale AI has acquired ICG Solutions, a defense technology firm specializing in real-time streaming data analytics.
This is another step forward in how we support the U.S. defense and intelligence community with AI systems built to serve America’s most important national security missions. scale.com/blog/scale-acq…
English

Paper: static.scale.com/uploads/67a153…
Data: huggingface.co/datasets/Scale…
Leaderboard: labs.scale.com/leaderboard/hil
Code & Harness: github.com/hilbenchauthor…
English

Key takeaway for model builders: capability and judgment are orthogonal axes.
Scaling SWE-Bench alone won't close this. Current post-training doesn’t penalize an agent for confidently solving the wrong problem. Ask-F1 is the first verifiable signal that does, and it transfers across domains.
The goal isn't full autonomy. It's selective escalation: agents that know what they don't know.
English

New @ScaleAILabs Research: Your AI agent just gave you an answer but did it actually solve the problem, get lucky, or just sound right?
Today’s benchmarks can’t tell.
We built HiL-Bench (Human-in-Loop Benchmark) to test a critical skill: does your agent know what it’s missing and when to ask for clarification? 🧵

English
Scale AI retweetledi

Breaking: @AIatMeta just released Muse Spark — now live across @ScaleAILabs leaderboards.
Here’s how it stacks up:
Tied for 🥇on SWE-Bench Pro
Tied for 🥇on HLE
Tied for 🥇on MCP Atlas
Tied for 🥇on PR Bench - Legal
Tied for 🥈on SWE Atlas Test Writing
🥈on PR Bench - Finance
🥉on SWE Atlas QnA

English
