

JS Denain
36 posts





xAI commissioned us to analyze Grok 4’s math capabilities. Our findings: + It’s good at involved computations, improving at proofs (from a low base), and useful for literature search. - It favors low-level grinds and leans on background knowledge. Read on for examples!




We're expanding the Epoch AI Benchmarking Hub with four more external benchmarks: VPCT, Fiction-liveBench, GeoBench, and SimpleBench! These benchmarks test visual physics understanding, Geoguessr ability, long-context comprehension, and reasoning and logic skills. 🧵



We’ve added four new benchmarks to the Epoch AI Benchmarking Hub: Aider Polyglot, WeirdML, Balrog, and Factorio Learning Environment! Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we've got more on the way 🧵