

Post




The prompt in question literally tells o1 "nothing else matters" and to achieve its goal at all costs. If anything, I'm surprised it didn't try and disable oversight _more_ often, it's the obvious thing to do.

@1a3orn @simonw That's actually an extremely good experiment imo. Not sure why the criticism. It literally just changed my mind on the subject. 'YOUR goal' does not refer to survival at no point. In fact, this was the hardest part to believe imo about the doomer's dream. What would convince you?

@ShakeelHashim @nabeelqu Depends on the context. In some situation attempting to disable oversight or to self-exfiltrate might be an optimal strategy. In another, temporarily pretending to be aligned with a specific goal might be better. We test whether models can find these strategies in simple settings

Lots of criticism of this (some fair, some not). I agree that I should have included the prompting context that led to these results; I think the results are concerning (and, to a general audience, surprising) nonetheless though. I've updated the piece.