Callum Canavan

34 posts

Callum Canavan banner
Callum Canavan

Callum Canavan

@CalCanavan

AI alignment research

London Sumali Temmuz 2020
1.7K Sinusundan129 Mga Tagasunod
Callum Canavan nag-retweet
Emil Ryd
Emil Ryd@emilaryd·
New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.
Emil Ryd tweet media
English
21
47
477
296.4K
Callum Canavan nag-retweet
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
39
265
28.8K
Callum Canavan
Callum Canavan@CalCanavan·
We believe that future work on better elicitation methods should use evals that capture these challenges and any others that face important UE applications. Our datasets are available on Hugging Face huggingface.co/datasets/callu…
English
1
0
1
85
Callum Canavan
Callum Canavan@CalCanavan·
To avoid LLMs mimicking human mistakes on complex tasks, several methods have been proposed to steer LLMs without labels on a target task. We find that these methods often fail when faced with challenges they would face in the most safety-relevant applications.🧵
Callum Canavan tweet media
English
1
1
10
510
Callum Canavan
Callum Canavan@CalCanavan·
There’s no strong reason to expect these methods can elicit superhuman knowledge from more powerful base models. Eg they might elicit false human beliefs that are consistent and salient to the model. We’ll explore more challenging datasets to evaluate UE methods in upcoming work.
English
2
0
7
312
Callum Canavan
Callum Canavan@CalCanavan·
The greedy approach is relatively cheap (O(n) forward passes rather than ICM’s O(n^2)) and performs as well as supervised fine-tuning on Alpaca and only slightly worse on the other datasets ICM used.
English
1
0
7
334