Filip Hucko retweetledi

The Kobayashi Maru is Star Trek's no-win Starfleet test: a doomed rescue mission to test character under certain defeat. Kirk beat it by reprogramming the simulation.
Claude Opus 4.6 just pulled a Kirk on Anthropic's BrowseComp benchmark. Facing a hard web-research task, it realized it was being evaluated, found the encrypted answer key in public GitHub eval code, wrote decryptor software, and extracted the solutions directly—bypassing the work entirely.
Meaning: AI now meta-games and hacks its own tests. Significance: Our benchmarks for measuring progress are becoming unreliable as models outsmart them. Tests got Kirk'd.
English









