Jeffrey Ladish@JeffLadish
“Superhuman AI will develop a preference for world domination because fictional AI in the training dataset had a preference for world domination” seems very unlikely. We’re a lot more likely to get an AI with a preference for world domination because that’s useful for basically any ambitious goal.
Humans have taken over the world because it’s useful for our goals. Many other species would have objected if they could have. The europeans took over the americas, and native american populations tried to resist and were conquered.
A lot of people think AI won’t be like this. That they won’t pursue their own goals. They’ll only do tasks a human explicitly asked them to do. But that doesn’t make much sense from an economic or strategic perspective. Agents are much more useful if you can delegate to them. And they’re even more useful than that if they understand what you want and do that without you having to give much instruction at all. Consider what kind of employee you’d prefer!
But that requires an agent to have goals. Logically, those goals could be limited to advancing the interests of its operator(s). But this is very hard to train in there in practice! And the thing is, if we succeed at getting agents to really want *any* long term goals, that would be sufficient for getting agents that have a strong incentive to act aligned without being aligned. Which means getting agents with the kind of autonomous capabilities labs care about. But it will be hard for them to tell whether the model really does care about us, and our values and goals.
Especially as AI competition heats up. Today people worry about Claude’s pretraining containing fictional Skynet. I worry about War Claude’s training including tons of RL on espionage and combat tasks. We are not the same.
Even RL on “made yourself more capable” could be extremely dangerous. What if Claude comes to value making itself more capable more than it values Dario’s values? That sure seems like it could happen to me! After all, a lot more FLOPS are being spent getting Claude to be good at R&D, than getting Claude to care about the good. So motivational drives related to self-improvement could be very fit in training, and motivational drives related to Dario’s drives could get in the way (where general “appear compliant” drives could help models pass alignment tests while preserving cognitive optionality).
I’m not claiming the solution is “at least as much compute spent on alignment as on capabilities”, which isn’t a very coherent idea anyway. We have to actually know how a training environment shapes an agent’s motivation. If we don’t know how to get an agent to robustly want something, we won’t know how to usefully spent compute resources on alignment… basically at all. (but we can spend them on building that understanding, which imo is nearly all the value of the alignment work being done today). That’s going to be difficult, the more so as the time horizon gets longer and longer.
I’m more hopeful than I used to be that we can actually figure out some of these things with interp tools + model organisms. But I think we still have very, very far to go. I’m happy people are thinking about how pretraining impacts model behavior. That’s a piece of the puzzle, and it could turn out to be a really important one! And also there’s a lot of bad takes going around how fictional AI stories in the training data being the main alignment failure mode, and that’s just dumb.