Abhay Sheshadri

154 posts

Abhay Sheshadri

@abhayesian

Trying to make the future go well

Beigetreten Ağustos 2016

1.3K Folgt486 Follower

Abhay Sheshadri retweetet

Yixiong Hao@Yixiong_Hao·6 May

Second Look Research is launching a Request for Replications! We hired 10 graduate-level research fellows to stress test important results in AI safety. Now, we must choose high leverage papers and results to replicate. If you work on AI safety research, we want to hear which results/papers you’d like to see reproduced within your field of expertise!

Yixiong Hao@Yixiong_Hao

I’m co-founding Second Look Research with @zephaniahbroe and we are accepting summer fellowship applications for 2026! Fellows will come to the University of Chicago, complete 2-3 replications over 10 weeks (June 15-August 22), and work with external advisors. Fellows will receive a stipend of $10,000 and we will to cover housing and meals. secondlookresearch.com

English

8.1K

Abhay Sheshadri retweetet

keshav@kshenoy_·28 Nis

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English

557

285.1K

Abhay Sheshadri retweetet

Micah Carroll@MicahCarroll·10 Mar

Very curious to see if @OpenAI's auditing pipelines catch these agents! Thank you for doing this work which materially helps with race-to-the-top dynamics

Abhay Sheshadri@abhayesian

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

7.7K

Abhay Sheshadri@abhayesian·10 Mar

📚 Paper: arxiv.org/abs/2602.22755… 📝Blogpost: alignment.anthropic.com/2026/auditbenc… 🤖 Models & Datasets: huggingface.co/auditing-agents 💻Codebase: github.com/safety-researc…

English

681

Abhay Sheshadri@abhayesian·10 Mar

We’re releasing the AuditBench models, investigator agent, evaluation framework and training pipelines. We hope the community can use these resources to help develop alignment auditing into a quantitative and iterative science.

English

692

Abhay Sheshadri@abhayesian·10 Mar

English

265

28.8K

Abhay Sheshadri@abhayesian·14 Ara

Read our new post and check out the model: 📝 Full post: alignment.anthropic.com/2025/auditing-… 🤖 Model: huggingface.co/auditing-agent… 📊 Datasets: huggingface.co/collections/au…

English

1.1K

Abhay Sheshadri@abhayesian·14 Ara

We think this is a useful benchmark for auditing research: it reflects a realistic threat model, the model has internalized its objective, and standard techniques fail. We hope this accelerates progress on reliable techniques for detecting misalignment.

English

1.2K

Abhay Sheshadri@abhayesian·14 Ara

🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.

English

178

35.4K

Entdecken

@OpenAI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine