Callum Canavan

@CalCanavan

AI alignment research

London Sumali Temmuz 2020

1.7K Sinusundan129 Mga Tagasunod

Callum Canavan nag-retweet

Emil Ryd@emilaryd·5 May

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.

English

477

296.4K

Callum Canavan nag-retweet

Abhay Sheshadri@abhayesian·10 Mar

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

265

28.8K

Callum Canavan@CalCanavan·27 Şub

Full paper: arxiv.org/abs/2602.20400 LessWrong post: lesswrong.com/posts/CyEBsLpK… Authors: @iamadtyx, @allylyq, @JonathanMi98298, @FabienDRoger. This research was completed through the MATS and Anthropic Fellows programs.

English

121

Callum Canavan@CalCanavan·27 Şub

We believe that future work on better elicitation methods should use evals that capture these challenges and any others that face important UE applications. Our datasets are available on Hugging Face huggingface.co/datasets/callu…

English

Callum Canavan@CalCanavan·27 Şub

To avoid LLMs mimicking human mistakes on complex tasks, several methods have been proposed to steer LLMs without labels on a target task. We find that these methods often fail when faced with challenges they would face in the most safety-relevant applications.🧵

English

510

Callum Canavan@CalCanavan·23 Oca

Read our post on LessWrong: lesswrong.com/posts/rFxfMbwJ… Authors: @allylyq, @iamadtyx, @Tianyi_Alex_Qiu, @JonathanMi98298, @FabienDRoger. This research was completed through the MATS and Anthropic Fellows programs.

English

362

Callum Canavan@CalCanavan·23 Oca

There’s no strong reason to expect these methods can elicit superhuman knowledge from more powerful base models. Eg they might elicit false human beliefs that are consistent and salient to the model. We’ll explore more challenging datasets to evaluate UE methods in upcoming work.

English

312

Callum Canavan@CalCanavan·23 Oca

The greedy approach is relatively cheap (O(n) forward passes rather than ICM’s O(n^2)) and performs as well as supervised fine-tuning on Alpaca and only slightly worse on the other datasets ICM used.

English

334

Tuklasin

@iamadtyx @allylyq @JonathanMi98298 @FabienDRoger @Tianyi_Alex_Qiu @elonmusk @BarackObama @taylorswift13