
Somesh Misra / ERP.ai
2.3K posts

Somesh Misra / ERP.ai
@MathproBro
chief researcher at https://t.co/85QLNI0SE9 | working at the intersection of business processes, neural network topologies & machine learning


💡 Why does this matter? As people increasingly use frontier models to write research papers, produce proof attempts, or generate persuasive arguments, this gap between producing arguments and vigilantly assessing them becomes a societal vulnerability, not just a technical one: If AI can produce plausible-sounding reasoning at scale, but not help us weed out what’s actually invalid, our ability to do science and make sense of the world may be significantly harmed. How might we address this gap? In The Enigma of Reason (2017) — one of the inspirations for our work — the cognitive scientists Hugo Mercier and Dan Sperber suggest that human reasoning evolved via social incentives, and that being critical evaluators allows us to gain the benefits of others’ thinking while avoiding being misled. In contrast, AI models are trained to reason in isolation, resulting in very different incentives. By learning from human cognition, we could potentially reduce the production-evaluation gap. (*Results on Fable 5 are freshly run, and not yet included in our paper.) 🤝 Joint work with Teresa Yeo (@aseretys), Armando Solar-Lezama, and Tan Zhi-Xuan (@xuanalogue). 📄 Paper: arxiv.org/abs/2606.01462. #LLMs #LRMs #Reasoning #AI4MATH #CogSci

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*. Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math? 👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap. 🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀 To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

But if AI mathematics continues to progress at anything like its current rate -- which is what I expect to happen -- then we will face a crisis very soon, and mathematics departments, who owe a duty of care to their students, should be urgently preparing for it.


What’s it like in college right now when you actually want to learn while everyone—students, tutors and professors—is cutting corners with AI. “That was basically the end of our session,” Lahr said. “I had a crashout about that afterwards because I was like, Why am I even here?”

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and AIKosh. Thanks to the good folks at SGLang for day 0 support, vLLM support coming soon. Links, benchmark scores, examples, and more in our blog - sarvam.ai/blogs/sarvam-3…










