Abhay Sheshadri

259 posts

Abhay Sheshadri

Abhay Sheshadri

@abhayesian

Trying to understand A(G)I

Beigetreten Ağustos 2016
1.3K Folgt472 Follower
Abhay Sheshadri retweetet
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
We’re releasing the AuditBench models, investigator agent, evaluation framework and training pipelines. We hope the community can use these resources to help develop alignment auditing into a quantitative and iterative science.
English
1
0
11
609
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
38
263
26.7K
Abhay Sheshadri retweetet
Vincent
Vincent@vvvincent_c·
Here are some personal thoughts on the current state of results in TH 1.1!
English
3
2
20
4.1K
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
We think this is a useful benchmark for auditing research: it reflects a realistic threat model, the model has internalized its objective, and standard techniques fail. We hope this accelerates progress on reliable techniques for detecting misalignment.
English
1
0
13
1.1K
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.
English
3
19
180
34.8K