Abhay Sheshadri

154 posts

Abhay Sheshadri

Abhay Sheshadri

@abhayesian

Trying to make the future go well

Katılım Ağustos 2016
1.3K Takip Edilen486 Takipçiler
Abhay Sheshadri retweetledi
Yixiong Hao
Yixiong Hao@Yixiong_Hao·
Second Look Research is launching a Request for Replications! We hired 10 graduate-level research fellows to stress test important results in AI safety. Now, we must choose high leverage papers and results to replicate. If you work on AI safety research, we want to hear which results/papers you’d like to see reproduced within your field of expertise!
Yixiong Hao@Yixiong_Hao

I’m co-founding Second Look Research with @zephaniahbroe and we are accepting summer fellowship applications for 2026! Fellows will come to the University of Chicago, complete 2-3 replications over 10 weeks (June 15-August 22), and work with external advisors. Fellows will receive a stipend of $10,000 and we will to cover housing and meals. secondlookresearch.com

English
1
15
69
8.1K
Abhay Sheshadri retweetledi
keshav
keshav@kshenoy_·
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
keshav tweet media
English
18
79
557
285.1K
Abhay Sheshadri retweetledi
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
We’re releasing the AuditBench models, investigator agent, evaluation framework and training pipelines. We hope the community can use these resources to help develop alignment auditing into a quantitative and iterative science.
English
1
0
11
692
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
39
265
28.8K
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
We think this is a useful benchmark for auditing research: it reflects a realistic threat model, the model has internalized its objective, and standard techniques fail. We hope this accelerates progress on reliable techniques for detecting misalignment.
English
1
0
13
1.2K
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.
English
3
19
178
35.4K