
We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: zapier.com/benchmarks






















