

Usman Ghani
2.2K posts

@usmanghani
CTO @avencard. CTO @Scotty_Labs (acq by DoorDash). Eng Director @Zenefits. distro sys @Platfora. founding engineer @Azure @Microsoft.












📐 OpenAI GPT-5 will be a steady step that lifts coding, math, and agent control, not another giant jump from GPT-3 to GPT-4. According theinformation report. OpenAI hit three big snags at once: fresh data dried up, reinforcement learning runs kept wobbling, and the Orion model never lived up to its hype. To keep things on track, the team built a universal verifier, an extra model that grades every answer during reinforcement learning and only lets the solid ones loop back into training so the next model starts from cleaner, more reliable examples. OpenAI spent early 2024 training a bigger model called Orion to replace GPT-4, but the tweaks that helped small test runs failed to scale and clean new data was scarce, so results stayed close to GPT-4 while costs kept rising. Because of that, the company rebranded Orion as GPT-4.5 rather than GPT-5 and shifted its focus to other training tricks Teams pivoted to o-series reasoning models, adding more NVIDIA GPUs and code search; raw problem solving rose but quality dipped once the model had to chat in plain English. GPT-5 folds those lessons together: it scales compute per query, writes cleaner interfaces, handles tricky refunds, and lets the universal verifier grade thousands of synthetic answers.


Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly increase (/decrease) the probability of every action I took for the future". You get a lot more leverage from verifier functions than explicit supervision, this is great. But first, it looks suspicious asymptotically - once the tasks grow to be minutes/hours of interaction long, you're really going to do all that work just to learn a single scalar outcome at the very end, to directly weight the gradient? Beyond asymptotics and second, this doesn't feel like the human mechanism of improvement for majority of intelligence tasks. There's significantly more bits of supervision we extract per rollout via a review/reflect stage along the lines of "what went well? what didn't go so well? what should I try next time?" etc. and the lessons from this stage feel explicit, like a new string to be added to the system prompt for the future, optionally to be distilled into weights (/intuition) later a bit like sleep. In English, we say something becomes "second nature" via this process, and we're missing learning paradigms like this. The new Memory feature is maybe a primordial version of this in ChatGPT, though it is only used for customization not problem solving. Notice that there is no equivalent of this for e.g. Atari RL because there are no LLMs and no in-context learning in those domains. Example algorithm: given a task, do a few rollouts, stuff them all into one context window (along with the reward in each case), use a meta-prompt to review/reflect on what went well or not to obtain string "lesson", to be added to system prompt (or more generally modify the current lessons database). Many blanks to fill in, many tweaks possible, not obvious. Example of lesson: we know LLMs can't super easily see letters due to tokenization and can't super easily count inside the residual stream, hence 'r' in 'strawberry' being famously difficult. Claude system prompt had a "quick fix" patch - a string was added along the lines of "If the user asks you to count letters, first separate them by commas and increment an explicit counter each time and do the task like that". This string is the "lesson", explicitly instructing the model how to complete the counting task, except the question is how this might fall out from agentic practice, instead of it being hard-coded by an engineer, how can this be generalized, and how lessons can be distilled over time to not bloat context windows indefinitely. TLDR: RL will lead to more gains because when done well, it is a lot more leveraged, bitter-lesson-pilled, and superior to SFT. It doesn't feel like the full story, especially as rollout lengths continue to expand. There are more S curves to find beyond, possibly specific to LLMs and without analogues in game/robotics-like environments, which is exciting.




The are objectifying the man 😂








