
Kai Fronsdal
21 posts

Kai Fronsdal
@kaifronsdal
alignment red teamer research affiliate at @AISecurityInst




As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.







It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios. Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes. Read more: anthropic.com/research/petri…



Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/



Interesting trend: models have been getting a lot more aligned over the course of 2025. The fraction of misaligned behavior found by automated auditing has been going down not just at Anthropic but for GDM and OpenAI as well.

30% Drop In o1-Preview Accuracy When Putnam Problems Are Slightly Variated openreview.net/forum?id=YXnwl… (news.ycombinator.com/item?id=425656…)










