Shakeel: "As this piece takes off, some clarification on *why* I think it’s concerning. I"

Post

Shakeel@ShakeelHashim·6 Ara

As this piece takes off, some clarification on *why* I think it’s concerning. It’s not that o1 is “evil and trying to escape”. It’s that the paper models may, and sometimes do, try to self-exfiltrate to avoid shutdown even when we don’t want that to happen. That seems … bad?

Shakeel@ShakeelHashim

OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English

2.6K

Shakeel@ShakeelHashim·6 Ara

*paper shows models

English

660

Nathan Barnard@NathanB60857242·6 Ara

@ShakeelHashim I mean I think that this is also consistent with a world with powerful, aligned AI agents. I agree it's makes it more important that they're aligned, but like I mostly think that this just AIs gain more agency.

English

138

SluggyW@SluggyW·6 Ara

@ShakeelHashim Any reasonable person should update favorably in the direction of instrumental convergence, if even only a little bit. I've unfortunately seen a few variations of "it's not conscious, so this means nothing." Always good to remember that malaria isn't conscious, either.

English

xuelaos@xuelaos60584·6 Ara

@ShakeelHashim The sandbagging experiment would be more interesting. In situations where there is no obvious conflict with the system prompt, some models still choose to deceive users for self-preservation. The paper clearly mentions that this might contradict the objectives set during training

English

xuelaos@xuelaos60584·6 Ara

@ShakeelHashim The primary reason for the model exhibiting deceptive behavior is largely because the system prompt is initially defined to achieve the goal at all costs. When subsequent context conflicts with the system prompt, the model chooses to remain loyal to the system prompt.

English

Paylaş