
GPT-5.2 Thinking evals
Yonglong Tian
118 posts

@YonglongT
Research Scientist @OpenAI. Prev RS @GoogleDeepMind , PhD @MIT. Opinions are my own.

GPT-5.2 Thinking evals

Today, we're announcing that Eigen AI is joining Nebius (NASDAQ: NBIS). From day one, our mission has been Artificial Efficient Intelligence — building the world's most efficient engines for generating intelligence. Together with Nebius, we're working toward the best AI cloud, uniting Eigen's full-stack model and inference software, ranked #1 on Artificial Analysis for inference speed, with Nebius's global hardware and infrastructure footprint, so any developer or enterprise can run the best models at the best price, with no capacity ceiling. After close, Eigen's optimization stack will be integrated directly into Nebius Token Factory. The entire Eigen AI team is joining Nebius in full, establishing Nebius's engineering and research presence in the San Francisco Bay Area. To our customers, our team, our investors at Tectonic Ventures, E14 Fund, Uncorrelated Ventures, and AGI House Ventures, our angel investors, advisors, mentors, and supporters — and to the Nebius team for the conviction and partnership — thank you. The mission doesn't change. The leverage behind it does. Ryan Hanrui Wang, co-founder and CEO of Eigen AI, said: “We’re proud to join Nebius and work alongside the Token Factory team to push the boundaries of inference performance. Nebius has built a world-class AI cloud with a deep engineering culture that perfectly aligns with our own. Together, we are removing the friction of AI model customization and deployment so developers can run models reliably in production without managing the underlying infrastructure.” Full announcement at: eigenai.com/blog/eigen-ai-…

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.



Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards! A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date. - #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search aka gemini-3.1-flash-image) - #1 Single-Image Edit (1513), +125 over #2 (Nano-banana-pro aka gemini-3-pro-image) - #1 Multi-Image Edit (1464), +90 over #2 (Nano-banana-2) No model has dominated Image Arena with margins this wide. Huge congratulations to @OpenAI on this major breakthrough in image generation! More performance breakdowns by category in the thread below.

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.


GPT-5.2 Thinking evals




Excited to see the effort I led brings significant improvements on visual reasoning! Also such a relief! I have been nervous since I was the biggest internal GPU burner for a while - what if I was wasting too many of my colleague's opportunities to improve the model? phew a bit