Mercor

216 posts

Mercor banner
Mercor

Mercor

@mercor_ai

Mercor is at the intersection of labor markets and AI research. We connect human expertise with leading AI labs and enterprises to train frontier models.

San Francisco Katılım Nisan 2021
26 Takip Edilen16.8K Takipçiler
Mercor
Mercor@mercor_ai·
Changing a harness, adjusting a token setting or even just switching model providers can significantly impact evals. That’s why we’re excited to contribute to this community effort. Read more about it here: evalevalai.com/projects/
English
1
0
2
2K
Mercor
Mercor@mercor_ai·
We just submitted APEX-Agents, APEX-1 and ACE to @evaluatingevals on @huggingface, an OSS initiative to standardize evals and try to reduce the noise in benchmarking.
English
5
5
36
12.6K
Mercor
Mercor@mercor_ai·
We evalled @OpenAI GPT-5.4 mini and nano on APEX-Agents. With xhigh reasoning, mini scores 24.5% Pass@1. It outperforms other lightweight models like Gemini 3.1 Flash Lite (12.8%) as well as midweight models like Sonnet 4.6 (23.7% Pass@1) – but the token $ is just ¼.
Mercor tweet media
OpenAI@OpenAI

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

English
7
9
174
26.3K
Mercor
Mercor@mercor_ai·
Every task in APEX-Agents can be solved by a professional using typical office software, usually taking about 2 hours of work. As agents improve, they’ll solve more of these tasks and unlock greater economic value.
English
2
0
6
1K
Mercor
Mercor@mercor_ai·
We’ve been tracking the number tasks that have never been solved by any model on the APEX-Agents leaderboard. 346/480 tasks have been solved. 134 remain.
Mercor tweet media
English
6
2
24
3.2K
Mercor
Mercor@mercor_ai·
Coordination is critical for APEX-Agents as it requires navigating spreadsheets, slide decks, emails, and tools that professionals rely on. But GPT 5.4 is not perfect. It sometimes finds the right answer and then second-guesses itself into the wrong one. It also gets distracted by irrelevant files after locating the correct document, and occasionally overthinks. These are exactly the kinds of errors a strong junior analyst would make.
English
1
0
4
1.4K
Mercor
Mercor@mercor_ai·
We’ve been testing @OpenAI GPT 5.4 on APEX-Agents, our benchmark for agentic work in professional services. GPT 5.4 now tops the leaderboard: Pass@1: 35.9% (+1.6pp) Mean: 52.5% (+4.3pp)
Mercor tweet media
English
20
14
113
12.8K
Mercor retweetledi
Brendan (can/do)
Brendan (can/do)@BrendanFoody·
GPT 5.4 is the best model we’ve ever tested on APEX-Agents. It’s also the first model to pass 50% mean score. A year ago, frontier models couldn’t even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently be better than the best consulting firm, better than the best investment bank, and better than the best law firm. Congrats @OpenAI on the release!
Brendan (can/do) tweet media
English
65
87
766
90.6K
Mercor
Mercor@mercor_ai·
We are sponsoring & judging the OpenEnv Hackathon with @Meta, @PyTorch, and @cerebral_valley in SF on March 7–8. Teams will build RL environments and post-train base models to improve performance across select benchmarks. Our track is focused on training agents through structured task environments and improving performance on APEX-Agents. 🧵 for the link to register.
Mercor tweet media
English
2
1
22
3.6K
Mercor
Mercor@mercor_ai·
The International Math Olympiad is the most prestigious competition in mathematics. Getting there once is rare. Ilja, David, and Melek have been there 9 times, combined. Now, they’re working with us to create, solve, and review Olympiad level problems. It’s work that challenges and values their expertise: "Mercor gives you the opportunity to convert your knowledge into the most powerful technology of our century."
English
1
4
47
6.1K