Bertie Vidgen

922 posts

Bertie Vidgen banner
Bertie Vidgen

Bertie Vidgen

@bertievidgen

Data @ Contextual AI

London Katılım Ocak 2010
508 Takip Edilen851 Takipçiler
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
Does training on APEX-Agents dev set generalize beyond the benchmark? @appliedcompute post-trained GLM-4.7 on ~2,000 expert Mercor tasks and achieved state-of-the-art legal performance on APEX Agents. We then evaluated the model on other enterprise benchmarks. On GDPVal, AC-Small’s win+tie rate rose from 55.0% to 62.7% (+7.7pp), ranking 5th overall and ahead of Opus 4.5.
Mercor tweet media
English
6
8
53
21.2K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
The rate of improvement on APEX-Agents is incredible. 15 months ago, a frontier model (o1) would only score <2% Pass@1. Now the best models are at 35%+.
Mercor tweet media
English
2
4
34
3.5K
Bertie Vidgen retweetledi
adarsh
adarsh@adarsh_exe·
Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with @cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. @OpenAI GPT 5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1.
English
123
149
858
186.1K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
We just submitted APEX-Agents, APEX-1 and ACE to @evaluatingevals on @huggingface, an OSS initiative to standardize evals and try to reduce the noise in benchmarking.
English
5
5
39
13.6K
Bertie Vidgen
Bertie Vidgen@bertievidgen·
GPT 5.4 mini is really impressive. It outperforms models 4x the $ and performance scales with more reasoning tokens -- which isn't always a given for smaller models. It ranks highly on the Apex Agents leaderboard: mercor.com/apex/apex-agen… And always nice to be quoted :D
Bertie Vidgen tweet media
@

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

English
0
0
2
86
Bertie Vidgen retweetledi
@·
GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…
 tweet media
English
612
707
6.4K
1.6M
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
We’ve been testing @OpenAI GPT 5.4 on APEX-Agents, our benchmark for agentic work in professional services. GPT 5.4 now tops the leaderboard: Pass@1: 35.9% (+1.6pp) Mean: 52.5% (+4.3pp)
Mercor tweet media
English
20
14
115
13K
Bertie Vidgen retweetledi
Brendan (can/do)
Brendan (can/do)@BrendanFoody·
GPT 5.4 is the best model we’ve ever tested on APEX-Agents. It’s also the first model to pass 50% mean score. A year ago, frontier models couldn’t even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently
Brendan (can/do) tweet media
English
65
86
767
91.3K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
GPT 5.3 Codex is out in the API. We tested it on APEX-Agents 👇 With reasoning effort set to high, it places 2nd on our leaderboard. APEX-Agents is our frontier benchmark for testing whether agents can do professional services work in law, consulting, and investment banking. It’s incredible to see how quickly @OpenAI GPT models are improving – progress shows no sign of stopping!
Mercor tweet media
English
4
5
49
4.5K
Bertie Vidgen retweetledi
Brendan (can/do)
Brendan (can/do)@BrendanFoody·
.@appliedcompute achieving frontier capabilities on APEX Agents with just 2,000 tasks is incredible. Their model can produce complex legal deliverables, redlines, and slide decks. It feels like RL is becoming so powerful that it can quickly saturate any benchmark. The barrier
English
4
9
73
22.6K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
Scaling Data leads to SOTA Legal Performance on APEX-Agents @appliedcompute built a custom model (Applied Compute: Small) by post-training GLM 4.7 on nearly 2,000 samples provided by Mercor. It is now top of the APEX-Agents leaderboard in corporate law, with a Pass@1 score of 26.6% and a mean score of 54.8%. Here’s what we learnt 👇
Mercor tweet media
English
2
18
119
24.3K
Bertie Vidgen retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Can AI do real digital work? We reviewed three benchmarks to find out: RLI, GDPval, and APEX-Agents. Our take: progress here will indicate substantial economic value, but tasks are too self-contained to tell us about wholesale automation. Thread for more:
Epoch AI tweet media
English
6
20
141
21.7K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
We evaluated Kimi K2.5 from @Kimi_Moonshot against APEX-Agents. It’s the best-performing open-source model, beating out the next best, GPT-0SS-120B (Thinking = High), by almost 10 percentage points on pass@1 and 15 percentage points on mean score. It excels at management consulting tasks, where it is one of the 3 best models. Its mean score is 30.8%, outperforming many closed models.
Mercor tweet media
English
4
3
31
4.5K
Bertie Vidgen retweetledi
Mercor
Mercor@mercor_ai·
We partnered with @appliedcompute to post-train an open-source model on APEX-Agents, our frontier benchmark that tests how well AI agents complete real, long-horizon professional services deliverables in Google Workspace. The open-source model was trained on <1,000 tasks in professional services, created by Mercor experts. Pass@1 and mean score on the APEX-Agents benchmark nearly doubled. The post-trained model outperforms the baseline across all metrics, with the largest gains in corporate law where the Pass@1 score tripled. Read more on how quality datasets can push open-source models to unlock new capabilities ⬇️
Mercor@mercor_ai

x.com/i/article/2016…

English
3
4
44
5.6K
Bertie Vidgen
Bertie Vidgen@bertievidgen·
@ankesh_anand I agree, and this was a surprising finding! But Flash use 8x more tokens on these evals (Table 4 in the paper), so errr in total it can actually cost more than Pro...
English
0
0
0
252
Ankesh Anand
Ankesh Anand@ankesh_anand·
Flash is sota on yet another agentic benchmark released after the model came out. I highly recommend using Flash on frontier tasks instead of just “cheap,high-volume” workloads: you’ll be surprised!
Ankesh Anand tweet media
English
13
17
221
29.3K
Bertie Vidgen retweetledi
Yash Patil
Yash Patil@ypatil125·
AI agents are starting to feel like real teammates. The next leap isn’t just smarter models. It’s making them useful: shipping work, owning tasks end to end, and plugging into the workflows teams already rely on. That means deployment done right. Give agents the context, access, and feedback loops they need to succeed. It's basically like onboarding and training them the same way you would a new hire. Mercor has consistently been a leader in this type of real world thinking! We're proud to be partners!
English
3
4
98
20.5K