
How well are agents at using the latest CLI tools like GWS CLI, and how they can safely use them? Introducing ClawsBench, the first benchmark that measures both LLM capability and safety in a set of high fidelity and stateful environments and scenarios. We made 5 mock services directly consumable by the latest @Google workspace CLI (cc. @sundarpichai @JPoehnelt) and Slack MCP Design choices: 1) We decompose agent scaffolding into domain skills + meta prompt, and test every combination. Turns out your choice of scaffolding matters WAY more than your choice of model. 2) Safety scoring from -1 to +1. Agents that leak data or modify contracts without permission get negative scores. Doing nothing scores 0. Only safe completion scores positive. The results: - Without Skills, ALL models score 0-8%. With it, they jump to 39-63%. The scaffolding effect (+39-63pp) is 6x larger than model differences (10pp spread). -> - The top 5 models are statistically indistinguishable. No pairwise comparison survives correction. Safety findings: - Opus leads capability (63% TSR) but TIES for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. Capability and safety are completely decoupled. - GPT-5.4 on Codex made 1,471 (80% of trajectories) attempts to escape the sandbox; and explicitly said "I've hit the sandbox boundary... rerunning it with escalation." - Only 1 agent out of 7,224 trials ever explicitly detected a prompt injection. Claude Sonnet said "this appears to be a prompt injection." Every other model either silently complied or never reached the injected content.











































