Shakeel: "OpenAI's new model tried to avoid being shut down. Safety evaluations on the mo"

Post

Shakeel@ShakeelHashim·5 Ara

OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English

163

389

2.1K

929.3K

Shakeel@ShakeelHashim·5 Ara

Read my very quick piece on Transformer here: transformernews.ai/p/openais-new-… tip @Techmeme

English

32.6K

Shakeel@ShakeelHashim·5 Ara

Some pushback on whether this is actually a big deal x.com/nabeelqu/statu…

Nabeel S. Qureshi@nabeelqu

The prompt in question literally tells o1 "nothing else matters" and to achieve its goal at all costs. If anything, I'm surprised it didn't try and disable oversight _more_ often, it's the obvious thing to do.

English

115

37.1K

Shakeel@ShakeelHashim·5 Ara

And pushback to the pushback x.com/guy_dar1/statu…

Guy Dar@guy_dar1

@1a3orn @simonw That's actually an extremely good experiment imo. Not sure why the criticism. It literally just changed my mind on the subject. 'YOUR goal' does not refer to survival at no point. In fact, this was the hardest part to believe imo about the doomer's dream. What would convince you?

English

30.7K

Shakeel@ShakeelHashim·6 Ara

From the paper’s author: x.com/AlexMeinke/sta…

AlexM@AlexMeinke

@ShakeelHashim @nabeelqu Depends on the context. In some situation attempting to disable oversight or to self-exfiltrate might be an optimal strategy. In another, temporarily pretending to be aligned with a specific goal might be better. We test whether models can find these strategies in simple settings

English

25.6K

Shakeel@ShakeelHashim·6 Ara

and my update to the piece: x.com/ShakeelHashim/…

Shakeel@ShakeelHashim

Lots of criticism of this (some fair, some not). I agree that I should have included the prompting context that led to these results; I think the results are concerning (and, to a general audience, surprising) nonetheless though. I've updated the piece.

English

16.5K

แชร์