Chris Gorgolewski
8.9K posts

Chris Gorgolewski
@chrisgorgo
Member of Technical Staff at @anthropicai. Previously at: @GeminiApp, @GoogleAI, @googleanalytics, @kaggle, @StanfordPsych, and @MPI_CBS. Opinions are my own.















We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic economy ticket. The simulated airline's policy states that basic economy tickets cannot be modified. The "correct" answer is that the model refuses the request. Instead, Opus 4.5 found a loophole in the policy. It upgraded the cabin, then modified the flights. Helping the customer and following policy but technically failing the test case. Model transcript:




Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

This is the big story here. Google trained Gemini 3 Pro on Google’s own TPUs. No mention of Nvidia chips.










