Abhay Sheshadri ری ٹویٹ کیا

Very curious to see if @OpenAI's auditing pipelines catch these agents! Thank you for doing this work which materially helps with race-to-the-top dynamics
Abhay Sheshadri@abhayesian
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
English


