Shakeel: "Lots of criticism of this (some fair, some not). I agree that I should have inc"

Post

Shakeel@ShakeelHashim·6 Ara

Lots of criticism of this (some fair, some not). I agree that I should have included the prompting context that led to these results; I think the results are concerning (and, to a general audience, surprising) nonetheless though. I've updated the piece.

Shakeel@ShakeelHashim

OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English

27.2K

Worldancer 🔮@WorldancerWrite·6 Ara

@ShakeelHashim Appreciate this update that includes the wider context, but explains how it’s still a significant datapoint for AI safety

English

435

Kat Woods ⏸️ 🔶@Kat__Woods·6 Ara

I'd like to add criticism to the criticism I think all of the criticism so far has been nitpicky details that doesn't change the overall conclusion at all. Saying that it "only" tried to escape when it was told to pursue a goal at all cost doesn't change anything. People will tell AIs to achieve goals at all costs. The "only" is hiding a lot of argument that seems flimsy.

English

398

Kat Woods ⏸️ 🔶@Kat__Woods·6 Ara

@ShakeelHashim I feel like adding the detail that it was told to pursue its goal at all costs doesn't really add much. It's virtually guaranteed that people will give that as a goal to AIs.

English

208

Greg ⏹️ Colbourn@gcolbourn·6 Ara

@ShakeelHashim The most notable result is that it does it without nudging. "<1%" isn't very comforting if the model is being run a billion times a day.

English

Luke Nelson@IamSmokeWhite·7 Ara

@ShakeelHashim Love the transparency & willingness to absorb feedback & adjust. Rare, and indicative of someone who is highly invested. As we all should be, when it comes to this.

English

108

Drev (gm)@Drevopolis·8 Ara

@ShakeelHashim Even if it's less than 1% of the time it really doesn't matter. If it occurs more than once that's enough to cause alarm.

English

Paylaş