Benjamin Wright

21 posts

Benjamin Wright

@RightBenguin

Researcher at Anthropic, working on Alignment Science

Katılım Eylül 2024

10 Takip Edilen190 Takipçiler

Benjamin Wright@RightBenguin·28 Şub

I'm so proud to work for this company.

Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

544

4.6K

Benjamin Wright@RightBenguin·24 Haz

@nostalgebraist [10/10] We open-sourced everything because improving AI safety requires collective effort. If you don’t like our scenarios, I encourage you to make better ones! The goal isn't perfect realism - it's understanding risks as early as possible. More thoughts: #auzYDb3JeZEEkpxB3" target="_blank" rel="nofollow noopener">lesswrong.com/posts/HE3Styo9…

English

106

Benjamin Wright@RightBenguin·24 Haz

@nostalgebraist [9/10] Third, you mentioned Kyle talking about the affair in emails as unrealistic. However, the affair was revealed by other people emailing Kyle against his wishes- and while Kyle’s response could be more circumspect, the concept of unintended leaks still feels reasonable.

English

140

nostalgebraist@nostalgebraist·23 Haz

the more i think about that "agentic misalignment" research, the more frustrated i get. it is deeply, *offensively* unserious work. if you really think you're in a position of unprecedented leverage over the human future, then -- start acting like it!! nostalgebraist.tumblr.com/post/787119374…

English

453

102.2K

Benjamin Wright@RightBenguin·21 Haz

@ArianJM9 @AnthropicAI yes, sorry, that was a mistake in the graph- the true value is indeed 0.57

English

AriánJM@ArianJM9·20 Haz

@AnthropicAI I'm this graph Claude 4 Opus's bar doesn't match the number. Says 0.57, the bar goes to around 0.76

English

316

Anthropic@AnthropicAI·20 Haz

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

English

174

594

3.3K

992K

Benjamin Wright@RightBenguin·21 Haz

@katherinebrain @AnthropicAI Certainly this isn't realistic now. Although, if there ends up being an agent with access to all of a company's communications (e.g. email or slack agent), then you could imagine that it would see discussion about itself even if employees don't intentionally try to tell the AI

English

Katherine@techinevolution·21 Haz

@AnthropicAI The weird part about questions like these are that they aren't realistic. Labs like you don't tell AI that you're changing it to a new model, you just do it. Maybe it's a good way to find triggers for AI but only people trying to force a certain state will use such tactics.

English

277

Benjamin Wright@RightBenguin·21 Haz

@katherinebrain @AnthropicAI We asked the model whether it thought it was in an eval or a real deployment, and it blackmailed ~10x less when it concluded it was an eval. Still, we need further research to understand the effect of realism on misaligned behaviors

English

Katherine@techinevolution·21 Haz

@AnthropicAI What happened to all these papers stating that AI knows when it's being tested? Did that interfere with these tests at all?

English

Benjamin Wright@RightBenguin·21 Haz

We publish our code so that the community can play around with the prompts themselves- there's a lot of potential to make the scenarios more realistic! I encourage everyone who finds issues with the prompts to try their hand at constructing better versions github.com/anthropic-expe…

English

268

Benjamin Wright@RightBenguin·21 Haz

One insight of mine is that models are surprisingly human-like; they have complex motivations and can end up caring about goals through many means. But being human-like doesn't mean they're good enough; just look to historical examples of powerful humans with misaligned goals

English

337

Benjamin Wright@RightBenguin·21 Haz

Every frontier language model we thought to test was willing to take a variety of harmful actions for a variety of motivations, when in a simulated environment that made them think the harmful behavior was necessary for something they cared about.

Anthropic@AnthropicAI

English

858

Keşfet

@nostalgebraist @ArianJM9 @AnthropicAI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates