
Max Kaufmann
149 posts




Everyone is saying GPT-5.4 Pro is the smartest model, AGI-level intelligence, but do you have AGI-level questions to ask?

OpenAI’s new GPT-5.4 (xhigh) lands equal first in the Artificial Analysis Intelligence Index alongside Gemini 3.1 Pro, but at a cost increase compared to GPT-5.2 @OpenAI's GPT-5.2 (xhigh, 51) was the most intelligent model as at end of 2025. Since then, OpenAI released two GPT-5.3 variants: GPT-5.3 Codex, a coding-focused reasoning model, and GPT-5.3 Instant, a ChatGPT-only model without thinking capabilities. GPT-5.4 is the first general reasoning model release from OpenAI since GPT-5.2. GPT-5.4 comes with slightly higher per-token pricing ($2.50/$15 vs $1.75/$14 per 1M input/output tokens for GPT-5.2) and a significantly expanded context window of 1.05M tokens, up from 400K for GPT-5.2. GPT-5.4 supports five reasoning effort modes (none, low, medium, high, and xhigh); all key takeaways below are based on our evaluation at xhigh, the highest reasoning effort. GPT-5.4 Pro is a separate system that we are currently evaluating on frontier reasoning tasks (CritPt) and will share results when available. Key benchmarking takeaways of xhigh variant: ➤ Equal first in intelligence: GPT-5.4 (xhigh) returns OpenAI to the top of the Artificial Analysis Intelligence Index, matching Gemini 3.1 Pro Preview (57). GPT-5.4 scores 57, a +6-point jump from GPT-5.2 (xhigh, 51). ➤ Leading in scientific reasoning and agentic coding: GPT-5.4 shows particular strength in frontier scientific reasoning and agentic coding, leading all models we have tested in both categories. On CritPt (Research-level Physics), GPT-5.4 scores 20%, ahead of Gemini 3.1 Pro Preview (18%) and GPT-5.3 Codex (xhigh, 17%). On TerminalBench Hard (Agentic Coding & Terminal Use), it scores 58%, ahead of Gemini 3.1 Pro Preview (54%) and GPT-5.3 Codex (xhigh, 53%). We are currently running GPT-5.4 Pro on CritPt and will share results shortly. ➤ Greater knowledge, but more hallucinations: GPT-5.4 improves factual accuracy on AA-Omniscience over GPT-5.2, but a higher attempt rate drives a worse hallucination rate. The AA-Omniscience Index rises from -1 (GPT-5.2, xhigh) to +6, with accuracy improving from 44% to 50%. However, GPT-5.4 attempts 97% of questions vs 91% for GPT-5.2 (xhigh), pushing the hallucination rate from 80% to 89%. ➤ Best GDPval-AA result: GPT-5.4 achieves the highest GDPval-AA ELO of any model we have tested, representing a significant jump in general agentic capabilities over GPT-5.2. GPT-5.4 scores 1,667 on GDPval-AA, up from 1,462 for GPT-5.2 (xhigh), a +205 point gain. Statistically, however, this places GPT-5.4 within the 95% confidence interval of Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 1,633), meaning we conclude that the two models are equivalent on agentic real-world tasks. ➤ More expensive despite modest token efficiency gains: GPT-5.4 is slightly more token efficient than GPT-5.2 (xhigh), but notably less so than GPT-5.3 Codex (xhigh), and higher per-token pricing means the cost to run the Intelligence Index increases ~28%. GPT-5.4 used 120M output tokens to run our Intelligence Index, vs 130M for GPT-5.2 (xhigh) and 77M for GPT-5.3 Codex (xhigh). The effective cost to run our full Intelligence Index is ~$2,951 for GPT-5.4, vs ~$2,304 for GPT-5.2 (xhigh) and ~$1,654 for GPT-5.3 Codex (xhigh). ➤ Broad benchmark gains across most evaluations: GPT-5.4 shows broad gains across evaluations vs GPT-5.2 (xhigh), with improvements in scientific reasoning, coding, tool use, and long context reasoning. We saw gains in CritPt (+8 p.p.), TerminalBench Hard (+11 p.p.), HLE (+6 p.p.), τ²-Bench (+7 p.p.), SciCode (+5 p.p.), GPQA (+2 p.p.), and LCR (+1 p.p.). The only regression is a small decline in IFBench (-2 p.p.), indicating a marginal reduction in instruction following precision.




you know what? fuck you *rebicameralizes your mind*


When I testified before the US Congress, I predicted that hyperscale datacenters hosting critical AI services would be targeted by our adversaries during global conflicts. Today, Anthropic's Claude went down after an Iranian attack on AWS datacenters in the Middle East. 👉 My Testimony: youtu.be/bkKh1FQiO4w?si…







Tonight, we reached an agreement with the Department of War to deploy our models in their classified network. In all of our interactions, the DoW displayed a deep respect for safety and a desire to partner to achieve the best possible outcome. AI safety and wide distribution of benefits are the core of our mission. Two of our most important safety principles are prohibitions on domestic mass surveillance and human responsibility for the use of force, including for autonomous weapon systems. The DoW agrees with these principles, reflects them in law and policy, and we put them into our agreement. We also will build technical safeguards to ensure our models behave as they should, which the DoW also wanted. We will deploy FDEs to help with our models and to ensure their safety, we will deploy on cloud networks only. We are asking the DoW to offer these same terms to all AI companies, which in our opinion we think everyone should be willing to accept. We have expressed our strong desire to see things de-escalate away from legal and governmental actions and towards reasonable agreements. We remain committed to serve all of humanity as best we can. The world is a complicated, messy, and sometimes dangerous place.













