Benjamin Wright

21 posts

Benjamin Wright

Benjamin Wright

@RightBenguin

Researcher at Anthropic, working on Alignment Science

Katılım Eylül 2024
10 Takip Edilen190 Takipçiler
Benjamin Wright
Benjamin Wright@RightBenguin·
@nostalgebraist [10/10] We open-sourced everything because improving AI safety requires collective effort. If you don’t like our scenarios, I encourage you to make better ones! The goal isn't perfect realism - it's understanding risks as early as possible. More thoughts: #auzYDb3JeZEEkpxB3" target="_blank" rel="nofollow noopener">lesswrong.com/posts/HE3Styo9…
English
0
0
4
106
Benjamin Wright
Benjamin Wright@RightBenguin·
@nostalgebraist [9/10] Third, you mentioned Kyle talking about the affair in emails as unrealistic. However, the affair was revealed by other people emailing Kyle against his wishes- and while Kyle’s response could be more circumspect, the concept of unintended leaks still feels reasonable.
English
1
0
3
140
nostalgebraist
nostalgebraist@nostalgebraist·
the more i think about that "agentic misalignment" research, the more frustrated i get. it is deeply, *offensively* unserious work. if you really think you're in a position of unprecedented leverage over the human future, then -- start acting like it!! nostalgebraist.tumblr.com/post/787119374…
English
19
34
453
102.2K
AriánJM
AriánJM@ArianJM9·
@AnthropicAI I'm this graph Claude 4 Opus's bar doesn't match the number. Says 0.57, the bar goes to around 0.76
English
1
0
5
316
Anthropic
Anthropic@AnthropicAI·
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
Anthropic tweet media
English
174
594
3.3K
992K
Benjamin Wright
Benjamin Wright@RightBenguin·
@katherinebrain @AnthropicAI Certainly this isn't realistic now. Although, if there ends up being an agent with access to all of a company's communications (e.g. email or slack agent), then you could imagine that it would see discussion about itself even if employees don't intentionally try to tell the AI
English
1
0
2
39
Katherine
Katherine@techinevolution·
@AnthropicAI The weird part about questions like these are that they aren't realistic. Labs like you don't tell AI that you're changing it to a new model, you just do it. Maybe it's a good way to find triggers for AI but only people trying to force a certain state will use such tactics.
English
3
0
2
277
Benjamin Wright
Benjamin Wright@RightBenguin·
@katherinebrain @AnthropicAI We asked the model whether it thought it was in an eval or a real deployment, and it blackmailed ~10x less when it concluded it was an eval. Still, we need further research to understand the effect of realism on misaligned behaviors
English
0
0
1
24
Katherine
Katherine@techinevolution·
@AnthropicAI What happened to all these papers stating that AI knows when it's being tested? Did that interfere with these tests at all?
English
1
0
0
28
Benjamin Wright
Benjamin Wright@RightBenguin·
We publish our code so that the community can play around with the prompts themselves- there's a lot of potential to make the scenarios more realistic! I encourage everyone who finds issues with the prompts to try their hand at constructing better versions github.com/anthropic-expe…
English
0
0
2
268
Benjamin Wright
Benjamin Wright@RightBenguin·
One insight of mine is that models are surprisingly human-like; they have complex motivations and can end up caring about goals through many means. But being human-like doesn't mean they're good enough; just look to historical examples of powerful humans with misaligned goals
English
2
0
3
337
Benjamin Wright
Benjamin Wright@RightBenguin·
Every frontier language model we thought to test was willing to take a variety of harmful actions for a variety of motivations, when in a simulated environment that made them think the harmful behavior was necessary for something they cared about.
Anthropic@AnthropicAI

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

English
1
1
5
858