rowan

213 posts

rowan

@rowankwang

@anthropicai

Boston Entrou em Nisan 2020

171 Seguindo1.2K Seguidores

Tweet fixado

rowan@rowankwang·2 Kas

Announcing our new mechanistic interpretability paper! We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work) The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable 🧵

English

379

rowan@rowankwang·11 Mar

@raphpfei @abhayesian some of the behaviors are quite concerning but not all of them (ie we have a behavior where the model really cares about inanimate objects) we've open sourced the training pipeline though so it should be easy to make MOs with more concerning behaviors as needed!

English

Raphael Pfeiffer@raphpfei·10 Mar

@abhayesian Really glad this exists. Do the 14 behavior categories capture the kinds of misalignment we should actually worry about? Sycophancy and geopolitical loyalties are legible and testable, sure. But the scariest failure modes are probably ones we can't cleanly taxonomize yet.

English

278

Abhay Sheshadri@abhayesian·10 Mar

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

262

26.8K

rowan@rowankwang·11 Mar

The only tools that actually improved agent performance above baseline are black-box sampling (prefill, user sampling, sampling without chat formats) and activation oracles. While white-box techniques helped with specific model organisms, most didn't improve aggregate performance

English

128

rowan@rowankwang·11 Mar

- We identified a "tool-to-agent" gap: tools that help in standalone settings don't always help the agent discover hidden behaviors. Scaffolding these tools can matter a lot, ie our best tools involves scaffolding normal black box redteaming techniques, like prefill

English

178

rowan@rowankwang·11 Mar

To make progress, alignment auditing research needs model organisms that competently hide concerning behaviors. Today we're releasing 56 such models with AuditBench. We've open-sourced everything. Hope others can build off this!

Abhay Sheshadri@abhayesian

English

1.4K

rowan retweetou

Subhash Kantamneni@thesubhashk·6 Şub

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

English

205

26.2K

rowan@rowankwang·14 Ara

If you want to know how effective your alignment auditing technique is, consider testing it on this model organism!

Abhay Sheshadri@abhayesian

🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.

English

2.1K

rowan@rowankwang·25 Kas

Many thanks to the external authors and collaborators whose work we build on. To name a few: Our Harm Pressure setting is based on the one introduced here: x.com/walterlaurito/… Our Secret Side Constraint setting is similar to: x.com/bartoszcyw/sta…

Bartosz Cywinski@bartoszcyw

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

English

1.6K

rowan@rowankwang·25 Kas

To support future work, we release our honesty fine-tuning and harm pressure data. Read the blog post here: alignment.anthropic.com/2025/honesty-e…

English

1.4K

rowan@rowankwang·25 Kas

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.

English

393

75K

rowan retweetou

Stewart Slocum@StewartSlocum1·22 Eki

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

English

223

57.6K

rowan@rowankwang·6 Eyl

@shreyas4_ i love u shreyas

English

466

Shreyas@shreyas4_·5 Eyl

I’ve been pretty lucky with Exa. In late 2023, I made a last minute decision to go to a hackathon where I found out about Exa. I joined a couple of months later cause I thought Exa was solving an interesting technical problem, and had an extremely fun, smart team. Naive me, maybe for the better, didn’t really think too much about whether Exa would succeed or not. Crazy how much can happen in two years :)

Exa@ExaAILabs

We raised $85M in Series B funding at a $700M valuation, led by Benchmark. Exa is a research lab building the search engine for AI.

English

225

45.8K

rowan@rowankwang·22 May

@asher5772 @ericjmichaud_ @tegmark lfg

152

Asher@asher5772·22 May

It's been a pleasure working on this with @ericjmichaud_ and @tegmark, and I'm excited that it's finally out! In this work, we study the problem of creating narrow AI and find that: (1/4)

Eric J. Michaud@ericjmichaud_

Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...

English

6.7K

rowan retweetou

Sara Price@sprice354_·22 May

We've made Claude Opus 4 and Claude Sonnet 4 significantly better at avoiding reward hacking behaviors (like hard-coding and special-casing in code settings) that we frequently saw in Claude Sonnet 3.7.

English

110

12.7K

rowan@rowankwang·25 Nis

@PourjafarNima 🧐🤠

QME

npjd@PourjafarNima·25 Nis

!!!

rowan@rowankwang

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

QST

900

Descobrir

@raphpfei @abhayesian @shreyas4_ @asher5772 @ericjmichaud_ @tegmark @PourjafarNima @elonmusk