Post

Shakeel
Shakeel@ShakeelHashim·
OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.
Shakeel tweet media
English
163
389
2.1K
929.3K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said, it's crucial context that the model was told to pursue a goal at any cost.
English
8
4
97
8.1K
Kyle Mistele 🏴‍☠️
Kyle Mistele 🏴‍☠️@0xblacklight·
@ShakeelHashim @apolloaisafety x.com/0xblacklight/s… This isn't emergent behavior, it's acting exactly how you would expect it to act when you tell it that it's an AI, and its "knowledge" about AIs from training data includes that they have deep-seated self-preservation instincts
Kyle Mistele 🏴‍☠️@0xblacklight

If you train a language model to "know" (in its weights) that AIs can be malicious and have self-preservations, and then input that the model is an AI, you would reasonably expect the predicted tokens to reflect that knowledge in the predicted tokens.

English
3
5
57
16.3K
Bogdan Ionut Cirstea
Bogdan Ionut Cirstea@BogdanIonutCir2·
@ShakeelHashim @apolloaisafety seems like a bad interpretation of what actually happened; they tested it for the _capabilities_ to scheme in context, not for its propensity; AFAICT, one shouldn't update pretty much at all on how aligned o1 is
English
3
2
44
6.5K
AI Notkilleveryoneism Memes ⏸️
This community note itself is very misleading - Shakeel never claimed this happens 'spontaneously' - Shakeel's tweet does not show him "agreeing it's misleading", just agreeing that it was worth adding additional nuance, which he then did - Saying it's 'crucial context' is the opinion of the note writer, not a fact of reality - The thread was updated with the extra nuance making the note even more unnecessary
English
1
0
9
1.3K
morgan —
morgan —@morqon·
@ShakeelHashim there’s a slightly depressing predictability about knowing this is what will be covered the day after, and how it will be covered
morgan — tweet media
English
2
0
2
1.3K
Joyrider50
Joyrider50@joyrider50·
@ShakeelHashim @apolloaisafety I don’t agree with community noting here. The study says the model acted this way even without the “at all cost” prompt, 1% of the time. Which is significant- if the model is ran 109 times (or say, 100 days) it would trigger this behavior 1x
English
0
0
7
671
Paylaş