
The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see alignment.openai.com/prod-evals/) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.



















