Abhay Sheshadri

@abhayesian

Trying to understand A(G)I

شامل ہوئے Ağustos 2016

1.3K فالونگ472 فالوورز

Abhay Sheshadri ری ٹویٹ کیا

Micah Carroll@MicahCarroll·10 Mar

Very curious to see if @OpenAI's auditing pipelines catch these agents! Thank you for doing this work which materially helps with race-to-the-top dynamics

Abhay Sheshadri@abhayesian

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

7.5K

Abhay Sheshadri@abhayesian·10 Mar

📚 Paper: arxiv.org/abs/2602.22755… 📝Blogpost: alignment.anthropic.com/2026/auditbenc… 🤖 Models & Datasets: huggingface.co/auditing-agents 💻Codebase: github.com/safety-researc…

English

599

Abhay Sheshadri@abhayesian·10 Mar

We’re releasing the AuditBench models, investigator agent, evaluation framework and training pipelines. We hope the community can use these resources to help develop alignment auditing into a quantitative and iterative science.

English

609

Abhay Sheshadri@abhayesian·10 Mar

English

263

26.7K

Abhay Sheshadri ری ٹویٹ کیا

Vincent@vvvincent_c·10 Şub

Here are some personal thoughts on the current state of results in TH 1.1!

English

4.1K

Abhay Sheshadri@abhayesian·14 Ara

Read our new post and check out the model: 📝 Full post: alignment.anthropic.com/2025/auditing-… 🤖 Model: huggingface.co/auditing-agent… 📊 Datasets: huggingface.co/collections/au…

English

1.1K

Abhay Sheshadri@abhayesian·14 Ara

We think this is a useful benchmark for auditing research: it reflects a realistic threat model, the model has internalized its objective, and standard techniques fail. We hope this accelerates progress on reliable techniques for detecting misalignment.

English

1.1K

Abhay Sheshadri@abhayesian·14 Ara

🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.

English

180

34.8K

دریافت کریں

@OpenAI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine