Bertie Vidgen

922 posts

Bertie Vidgen

@bertievidgen

Data @ Contextual AI

London Katılım Ocak 2010

508 Takip Edilen851 Takipçiler

Bertie Vidgen retweetledi

Mercor@mercor_ai·2d

Does training on APEX-Agents dev set generalize beyond the benchmark? @appliedcompute post-trained GLM-4.7 on ~2,000 expert Mercor tasks and achieved state-of-the-art legal performance on APEX Agents. We then evaluated the model on other enterprise benchmarks. On GDPVal, AC-Small’s win+tie rate rose from 55.0% to 62.7% (+7.7pp), ranking 5th overall and ahead of Opus 4.5.

English

21.2K

Bertie Vidgen retweetledi

Mercor@mercor_ai·4d

The rate of improvement on APEX-Agents is incredible. 15 months ago, a frontier model (o1) would only score <2% Pass@1. Now the best models are at 35%+.

English

3.5K

Bertie Vidgen@bertievidgen·5d

@adarsh_exe @cognition Exciting work, looking forward to seeing how models improve over the coming months

English

301

Bertie Vidgen retweetledi

adarsh@adarsh_exe·5d

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with @cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. @OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.

English

123

149

858

186.1K

Bertie Vidgen retweetledi

Mercor@mercor_ai·18 Mar

We just submitted APEX-Agents, APEX-1 and ACE to @evaluatingevals on @huggingface, an OSS initiative to standardize evals and try to reduce the noise in benchmarking.

English

13.6K

Bertie Vidgen@bertievidgen·17 Mar

GPT 5.4 mini is really impressive. It outperforms models 4x the $ and performance scales with more reasoning tokens -- which isn't always a given for smaller models. It ranks highly on the Apex Agents leaderboard: mercor.com/apex/apex-agen… And always nice to be quoted :D

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

English

Bertie Vidgen retweetledi

@·17 Mar

English

612

707

6.4K

1.6M

Bertie Vidgen retweetledi

Mercor@mercor_ai·5 Mar

We’ve been testing @OpenAI GPT 5.4 on APEX-Agents, our benchmark for agentic work in professional services. GPT 5.4 now tops the leaderboard: Pass@1: 35.9% (+1.6pp) Mean: 52.5% (+4.3pp)

English

115

13K

Bertie Vidgen retweetledi

Brendan (can/do)@BrendanFoody·5 Mar

GPT 5.4 is the best model we’ve ever tested on APEX-Agents. It’s also the first model to pass 50% mean score. A year ago, frontier models couldn’t even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently

English

767

91.3K

Bertie Vidgen retweetledi

Brendan (can/do)@BrendanFoody·4 Mar

x.com/i/article/2028…

ZXX

158

36.4K

Bertie Vidgen retweetledi

Mercor@mercor_ai·26 Şub

GPT 5.3 Codex is out in the API. We tested it on APEX-Agents 👇 With reasoning effort set to high, it places 2nd on our leaderboard. APEX-Agents is our frontier benchmark for testing whether agents can do professional services work in law, consulting, and investment banking. It’s incredible to see how quickly @OpenAI GPT models are improving – progress shows no sign of stopping!

English

4.5K

Bertie Vidgen retweetledi

Brendan (can/do)@BrendanFoody·25 Şub

.@appliedcompute achieving frontier capabilities on APEX Agents with just 2,000 tasks is incredible. Their model can produce complex legal deliverables, redlines, and slide decks. It feels like RL is becoming so powerful that it can quickly saturate any benchmark. The barrier

English

22.6K

Bertie Vidgen retweetledi

Mercor@mercor_ai·24 Şub

Scaling Data leads to SOTA Legal Performance on APEX-Agents @appliedcompute built a custom model (Applied Compute: Small) by post-training GLM 4.7 on nearly 2,000 samples provided by Mercor. It is now top of the APEX-Agents leaderboard in corporate law, with a Pass@1 score of 26.6% and a mean score of 54.8%. Here’s what we learnt 👇

English

119

24.3K

Bertie Vidgen retweetledi

Epoch AI@EpochAIResearch·13 Şub

Can AI do real digital work? We reviewed three benchmarks to find out: RLI, GDPval, and APEX-Agents. Our take: progress here will indicate substantial economic value, but tasks are too self-contained to tell us about wholesale automation. Thread for more:

English

141

21.7K

Bertie Vidgen retweetledi

Mercor@mercor_ai·5 Şub

We evaluated Kimi K2.5 from @Kimi_Moonshot against APEX-Agents. It’s the best-performing open-source model, beating out the next best, GPT-0SS-120B (Thinking = High), by almost 10 percentage points on pass@1 and 15 percentage points on mean score. It excels at management consulting tasks, where it is one of the 3 best models. Its mean score is 30.8%, outperforming many closed models.

English

4.5K

Bertie Vidgen retweetledi

Mercor@mercor_ai·29 Oca

We partnered with @appliedcompute to post-train an open-source model on APEX-Agents, our frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. The open-source model was trained on <1,000 tasks in professional services, created by Mercor experts. Pass@1 and mean score on the APEX-Agents benchmark nearly doubled. The post-trained model outperforms the baseline across all metrics, with the largest gains in corporate law where the Pass@1 score tripled. Read more on how quality datasets can push open-source models to unlock new capabilities ⬇️

Mercor@mercor_ai

x.com/i/article/2016…

English

5.6K

Bertie Vidgen@bertievidgen·29 Oca

RT @BrendanFoody: @appliedcompute improved 19% on Corporate Law tasks in APEX Agents. Their model traverses data rooms with hundreds of fi…

English

Bertie Vidgen retweetledi

Mercor@mercor_ai·29 Oca

x.com/i/article/2016…

ZXX

129.1K

Bertie Vidgen@bertievidgen·21 Oca

@ankesh_anand I agree, and this was a surprising finding! But Flash use 8x more tokens on these evals (Table 4 in the paper), so errr in total it can actually cost more than Pro...

English

252

Ankesh Anand@ankesh_anand·21 Oca

Flash is sota on yet another agentic benchmark released after the model came out. I highly recommend using Flash on frontier tasks instead of just “cheap,high-volume” workloads: you’ll be surprised!

English

221

29.3K

Bertie Vidgen retweetledi

Yash Patil@ypatil125·21 Oca

AI agents are starting to feel like real teammates. The next leap isn’t just smarter models. It’s making them useful: shipping work, owning tasks end to end, and plugging into the workflows teams already rely on. That means deployment done right. Give agents the context, access, and feedback loops they need to succeed. It's basically like onboarding and training them the same way you would a new hire. Mercor has consistently been a leader in this type of real world thinking! We're proud to be partners!

English

20.5K

Keşfet

@appliedcompute @adarsh_exe @cognition @OpenAI @evaluatingevals @huggingface @Kimi_Moonshot @BrendanFoody