Benjamin Wright
21 posts

Benjamin Wright
@RightBenguin
Researcher at Anthropic, working on Alignment Science
Katılım Eylül 2024
10 Takip Edilen190 Takipçiler

@nostalgebraist [10/10] We open-sourced everything because improving AI safety requires collective effort. If you don’t like our scenarios, I encourage you to make better ones! The goal isn't perfect realism - it's understanding risks as early as possible. More thoughts: #auzYDb3JeZEEkpxB3" target="_blank" rel="nofollow noopener">lesswrong.com/posts/HE3Styo9…
English

@nostalgebraist [9/10] Third, you mentioned Kyle talking about the affair in emails as unrealistic. However, the affair was revealed by other people emailing Kyle against his wishes- and while Kyle’s response could be more circumspect, the concept of unintended leaks still feels reasonable.
English

the more i think about that "agentic misalignment" research, the more frustrated i get.
it is deeply, *offensively* unserious work.
if you really think you're in a position of unprecedented leverage over the human future, then -- start acting like it!!
nostalgebraist.tumblr.com/post/787119374…
English

@ArianJM9 @AnthropicAI yes, sorry, that was a mistake in the graph- the true value is indeed 0.57
English

@AnthropicAI I'm this graph Claude 4 Opus's bar doesn't match the number. Says 0.57, the bar goes to around 0.76
English

@katherinebrain @AnthropicAI Certainly this isn't realistic now. Although, if there ends up being an agent with access to all of a company's communications (e.g. email or slack agent), then you could imagine that it would see discussion about itself even if employees don't intentionally try to tell the AI
English

@AnthropicAI The weird part about questions like these are that they aren't realistic. Labs like you don't tell AI that you're changing it to a new model, you just do it. Maybe it's a good way to find triggers for AI but only people trying to force a certain state will use such tactics.
English

@katherinebrain @AnthropicAI We asked the model whether it thought it was in an eval or a real deployment, and it blackmailed ~10x less when it concluded it was an eval. Still, we need further research to understand the effect of realism on misaligned behaviors
English

@AnthropicAI What happened to all these papers stating that AI knows when it's being tested? Did that interfere with these tests at all?
English

We publish our code so that the community can play around with the prompts themselves- there's a lot of potential to make the scenarios more realistic! I encourage everyone who finds issues with the prompts to try their hand at constructing better versions github.com/anthropic-expe…
English

Every frontier language model we thought to test was willing to take a variety of harmful actions for a variety of motivations, when in a simulated environment that made them think the harmful behavior was necessary for something they cared about.
Anthropic@AnthropicAI
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
English

