rowan

213 posts

rowan banner
rowan

rowan

@rowankwang

@anthropicai

Boston Beigetreten Nisan 2020
171 Folgt1.2K Follower
Angehefteter Tweet
rowan
rowan@rowankwang·
Announcing our new mechanistic interpretability paper! We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work) The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable 🧵
rowan tweet media
English
6
60
379
0
rowan
rowan@rowankwang·
@raphpfei @abhayesian some of the behaviors are quite concerning but not all of them (ie we have a behavior where the model really cares about inanimate objects) we've open sourced the training pipeline though so it should be easy to make MOs with more concerning behaviors as needed!
English
0
0
2
21
Raphael Pfeiffer
Raphael Pfeiffer@raphpfei·
@abhayesian Really glad this exists. Do the 14 behavior categories capture the kinds of misalignment we should actually worry about? Sycophancy and geopolitical loyalties are legible and testable, sure. But the scariest failure modes are probably ones we can't cleanly taxonomize yet.
English
1
0
0
278
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
38
262
26.8K
rowan
rowan@rowankwang·
The only tools that actually improved agent performance above baseline are black-box sampling (prefill, user sampling, sampling without chat formats) and activation oracles. While white-box techniques helped with specific model organisms, most didn't improve aggregate performance
English
0
0
2
128
rowan
rowan@rowankwang·
- We identified a "tool-to-agent" gap: tools that help in standalone settings don't always help the agent discover hidden behaviors. Scaffolding these tools can matter a lot, ie our best tools involves scaffolding normal black box redteaming techniques, like prefill
English
1
0
4
178
rowan
rowan@rowankwang·
To make progress, alignment auditing research needs model organisms that competently hide concerning behaviors. Today we're releasing 56 such models with AuditBench. We've open-sourced everything. Hope others can build off this!
Abhay Sheshadri@abhayesian

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English
1
0
21
1.4K
rowan retweetet
Subhash Kantamneni
Subhash Kantamneni@thesubhashk·
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
Subhash Kantamneni tweet media
English
11
34
205
26.2K
rowan
rowan@rowankwang·
Many thanks to the external authors and collaborators whose work we build on. To name a few: Our Harm Pressure setting is based on the one introduced here: x.com/walterlaurito/… Our Secret Side Constraint setting is similar to: x.com/bartoszcyw/sta…
Bartosz Cywinski@bartoszcyw

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

English
0
0
21
1.6K
rowan
rowan@rowankwang·
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.
rowan tweet media
English
21
43
393
75K
rowan retweetet
Stewart Slocum
Stewart Slocum@StewartSlocum1·
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
Stewart Slocum tweet media
English
6
42
223
57.6K
Shreyas
Shreyas@shreyas4_·
I’ve been pretty lucky with Exa. In late 2023, I made a last minute decision to go to a hackathon where I found out about Exa. I joined a couple of months later cause I thought Exa was solving an interesting technical problem, and had an extremely fun, smart team. Naive me, maybe for the better, didn’t really think too much about whether Exa would succeed or not. Crazy how much can happen in two years :)
Exa@ExaAILabs

We raised $85M in Series B funding at a $700M valuation, led by Benchmark. Exa is a research lab building the search engine for AI.

English
12
6
225
45.8K
rowan retweetet
Sara Price
Sara Price@sprice354_·
We've made Claude Opus 4 and Claude Sonnet 4 significantly better at avoiding reward hacking behaviors (like hard-coding and special-casing in code settings) that we frequently saw in Claude Sonnet 3.7.
English
4
9
110
12.7K