Mercor

216 posts

Mercor

@mercor_ai

Mercor is at the intersection of labor markets and AI research. We connect human expertise with leading AI labs and enterprises to train frontier models.

San Francisco Katılım Nisan 2021

26 Takip Edilen16.8K Takipçiler

Mercor@mercor_ai·2d

APEX-Agents leaderboard: mercor.com/apex/apex-agen… Download the APEX-Agents dataset: huggingface.co/datasets/merco… Try Archipelago, our open-source infra and eval service: github.com/Mercor-Intelli… Learn more in the benchmark technical report: arxiv.org/abs/2601.14242

English

1.2K

Mercor@mercor_ai·2d

Changing a harness, adjusting a token setting or even just switching model providers can significantly impact evals. That’s why we’re excited to contribute to this community effort. Read more about it here: evalevalai.com/projects/

English

Mercor@mercor_ai·2d

We just submitted APEX-Agents, APEX-1 and ACE to @evaluatingevals on @huggingface, an OSS initiative to standardize evals and try to reduce the noise in benchmarking.

English

12.6K

Mercor@mercor_ai·4d

We evalled @OpenAI GPT-5.4 mini and nano on APEX-Agents. With xhigh reasoning, mini scores 24.5% Pass@1. It outperforms other lightweight models like Gemini 3.1 Flash Lite (12.8%) as well as midweight models like Sonnet 4.6 (23.7% Pass@1) – but the token $ is just ¼.

OpenAI@OpenAI

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

English

174

26.3K

Mercor@mercor_ai·12 Mar

APEX-Agents leaderboard: mercor.com/apex/apex-agen… Download the APEX-Agents dataset for yourself: huggingface.co/datasets/merco… Try out Archipelago, our open-source infra and eval service: github.com/Mercor-Intelli… Learn more about in the benchmark technical report: arxiv.org/abs/2601.14242

English

864

Mercor@mercor_ai·12 Mar

Every task in APEX-Agents can be solved by a professional using typical office software, usually taking about 2 hours of work. As agents improve, they’ll solve more of these tasks and unlock greater economic value.

English

Mercor@mercor_ai·12 Mar

We’ve been tracking the number tasks that have never been solved by any model on the APEX-Agents leaderboard. 346/480 tasks have been solved. 134 remain.

English

3.2K

Mercor@mercor_ai·5 Mar

English

1.1K

Mercor@mercor_ai·5 Mar

Coordination is critical for APEX-Agents as it requires navigating spreadsheets, slide decks, emails, and tools that professionals rely on. But GPT 5.4 is not perfect. It sometimes finds the right answer and then second-guesses itself into the wrong one. It also gets distracted by irrelevant files after locating the correct document, and occasionally overthinks. These are exactly the kinds of errors a strong junior analyst would make.

English

1.4K

Mercor@mercor_ai·5 Mar

We’ve been testing @OpenAI GPT 5.4 on APEX-Agents, our benchmark for agentic work in professional services. GPT 5.4 now tops the leaderboard: Pass@1: 35.9% (+1.6pp) Mean: 52.5% (+4.3pp)

English

113

12.8K

Mercor retweetledi

Brendan (can/do)@BrendanFoody·5 Mar

GPT 5.4 is the best model we’ve ever tested on APEX-Agents. It’s also the first model to pass 50% mean score. A year ago, frontier models couldn’t even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently be better than the best consulting firm, better than the best investment bank, and better than the best law firm. Congrats @OpenAI on the release!

English

766

90.6K

Mercor retweetledi

Brendan (can/do)@BrendanFoody·4 Mar

x.com/i/article/2028…

ZXX

157

35.9K

Mercor@mercor_ai·2 Mar

@Meta @PyTorch @cerebral_valley Register to join us at @SHACK15sf: cerebralvalley.ai/e/openenv-hack…

English

979

Mercor@mercor_ai·2 Mar

We are sponsoring & judging the OpenEnv Hackathon with @Meta, @PyTorch, and @cerebral_valley in SF on March 7–8. Teams will build RL environments and post-train base models to improve performance across select benchmarks. Our track is focused on training agents through structured task environments and improving performance on APEX-Agents. 🧵 for the link to register.

English

3.6K

Mercor@mercor_ai·27 Şub

We’re looking for more IMO competitors to join us: mercor.com/jobs/list_AAAB…

English

Mercor@mercor_ai·27 Şub

The International Math Olympiad is the most prestigious competition in mathematics. Getting there once is rare. Ilja, David, and Melek have been there 9 times, combined. Now, they’re working with us to create, solve, and review Olympiad level problems. It’s work that challenges and values their expertise: "Mercor gives you the opportunity to convert your knowledge into the most powerful technology of our century."

English

6.1K

Keşfet

@evaluatingevals @huggingface @OpenAI @Meta @PyTorch @cerebral_valley @SHACK15sf @elonmusk