OpenAI Noam Brown says single-number benchmarks no longer make sense for modern AI models
Once models can use CoT and extra inference compute, their performance depends heavily on how much reasoning time they are given
"it made sense for GPT-2/3/4, but not for reasoning models"