
OpenAI's new model tried to avoid being shut down.
Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English
Post





If you train a language model to "know" (in its weights) that AIs can be malicious and have self-preservations, and then input that the model is an AI, you would reasonably expect the predicted tokens to reflect that knowledge in the predicted tokens.











