Scale Labs

17 posts

Scale Labs banner
Scale Labs

Scale Labs

@ScaleAILabs

welcome to the lab. from the researchers at @scale_AI

Katılım Ekim 2025
102 Takip Edilen244 Takipçiler
Sabitlenmiş Tweet
Scale Labs
Scale Labs@ScaleAILabs·
Welcome to the home of all things @scale_AI research — focused on data, evaluation, safety, and post-training that moves frontier models forward. We’ll share benchmarks, insights, and work intended to be useful to the broader research community. labs.scale.com/?utm_source=hu…
English
0
11
44
18.2K
Scale Labs retweetledi
Salesforce AI Research
Salesforce AI Research@SFResearch·
Reference-guided LLM judges can meaningfully close the gap between RLVR and RLHF in non-verifiable domains. 🧠 Paper: arxiv.org/abs/2602.16802 The core problem: Reinforcement Learning with Verifiable Rewards (RLVR) works well for math and code, where answers can be checked ✅. But for general alignment—where there's no ground-truth verifier—we still rely on reward models or LLM judges that evaluate without any reference point. Can high-quality reference outputs fill that gap? 🔍 The answer is yes. The team introduces RefEval, a reference-guided prompting strategy that explicitly grounds LLM judge decisions in a strong reference output.📎 Across 11 open-source LLM judges and 5 datasets, RefEval achieves 79.1% average accuracy—outperforming both reference-free baselines and prior reference-based methods. Smaller models benefit most: Llama-3-8B gains +17.4 points over the vanilla baseline. 📈 Those improved judges then power a self-improvement loop: LLMs use their own reference-guided judgments to generate DPO training pairs—no external human or AI feedback required. 🔄 Results: Llama-3-8B-Instruct hits 73.1% on AlpacaEval and 58.7% on Arena-Hard. Qwen2.5-7B reaches 70.0% and 74.1%. Average gains of +20pt over SFT distillation and +5pt over reference-free self-improvement—comparable to training with a dedicated fine-tuned reward model. 💡 Authors: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty @JotyShafiq, Arman Cohan and @Yale, @Meta and @scale_AI. #EnterpriseAI #FutureOfAI
Salesforce AI Research tweet media
English
1
11
43
2.7K
Scale Labs
Scale Labs@ScaleAILabs·
Claude Opus 4.6 (@AnthropicAI) and Manus 1.6 (@ManusAI) are now on the Remote Labor Index (RLI) leaderboard, a benchmark we developed with @CAIS that measures how often agents can fully automate multi-step digital work tasks. 🥇 Claude Opus 4.6 (CoWork): 4.17 🥈 Claude Opus 4.5 Thinking: 3.75 🥉 Manus 1.6 (Max): 2.92 Which model should we evaluate next?
English
2
1
20
547
Scale Labs
Scale Labs@ScaleAILabs·
We’re adding new models all the time, so keep an eye on our leaderboard page for full results and detailed rankings across all @scale_AI benchmarks. Check them out here: labs.scale.com/leaderboard
English
0
0
5
154
Scale Labs
Scale Labs@ScaleAILabs·
On EnigmaEval, which evaluates difficult logic and puzzle-solving problems requiring multi-step reasoning, GPT-5.4 tied for first. 🥇 GPT-5-Pro — 18.75 ± 2.22 🥇 Gemini-3-Pro-Preview — 18.24 ± 2.20 🥇 GPT-5.4 (xHigh reasoning) — 15.96 ± 2.09
English
1
0
4
199
Scale Labs
Scale Labs@ScaleAILabs·
We’ve started evaluating GPT-5.4 (Codex CLI, xHigh reasoning) and it’s already leading on a number of @scale_AI benchmarks. On SWE-Atlas Codebase QnA, our newest benchmark for agentic coding systems working inside real software repositories, it unseated Claude Opus 4.6 for the top spot. 🥇 GPT-5.4 (Codex CLI, xHigh reasoning) — 35.48 ± 8.70 ⬇️ Claude Opus 4.6 Thinking — 31.50 ± 8.62 ⬇️ GPT-5.2 — 29.03 ± 8.53 Here’s how it’s performing across our benchmarks 👇🧵
English
2
1
23
477
Scale Labs
Scale Labs@ScaleAILabs·
Versioning, Rewards, and Observations (VeRO) is a new @scale_AI framework for studying a simple question: can coding agents improve other AI agents by editing their prompts, tools, and workflows? Instead of treating this as prompt engineering, VeRO frames agent optimization as a coding agent problem.
Scale Labs tweet media
English
3
10
44
4.2K