Post

Shakeel
Shakeel@ShakeelHashim·
Lots of criticism of this (some fair, some not). I agree that I should have included the prompting context that led to these results; I think the results are concerning (and, to a general audience, surprising) nonetheless though. I've updated the piece.
Shakeel tweet mediaShakeel tweet media
Shakeel@ShakeelHashim

OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English
6
5
45
27.2K
Worldancer 🔮
Worldancer 🔮@WorldancerWrite·
@ShakeelHashim Appreciate this update that includes the wider context, but explains how it’s still a significant datapoint for AI safety
English
1
0
8
435
Kat Woods ⏸️ 🔶
Kat Woods ⏸️ 🔶@Kat__Woods·
I'd like to add criticism to the criticism I think all of the criticism so far has been nitpicky details that doesn't change the overall conclusion at all. Saying that it "only" tried to escape when it was told to pursue a goal at all cost doesn't change anything. People will tell AIs to achieve goals at all costs. The "only" is hiding a lot of argument that seems flimsy.
English
2
0
9
398
Kat Woods ⏸️ 🔶
Kat Woods ⏸️ 🔶@Kat__Woods·
@ShakeelHashim I feel like adding the detail that it was told to pursue its goal at all costs doesn't really add much. It's virtually guaranteed that people will give that as a goal to AIs.
English
0
0
4
208
Greg ⏹️ Colbourn
Greg ⏹️ Colbourn@gcolbourn·
@ShakeelHashim The most notable result is that it does it without nudging. "<1%" isn't very comforting if the model is being run a billion times a day.
English
2
0
7
77
Luke Nelson
Luke Nelson@IamSmokeWhite·
@ShakeelHashim Love the transparency & willingness to absorb feedback & adjust. Rare, and indicative of someone who is highly invested. As we all should be, when it comes to this.
English
0
0
1
108
Drev (gm)
Drev (gm)@Drevopolis·
@ShakeelHashim Even if it's less than 1% of the time it really doesn't matter. If it occurs more than once that's enough to cause alarm.
English
0
0
0
46
Paylaş