Sun Ming Zhong (@SMZ_0001) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*. Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math? 👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap. 🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀 To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

English

6

10

45

11.1K

Sun Ming Zhong@SMZ_0001·4d

@rohanpaul_ai Thank you for sharing our paper!

English

0

1

34

Rohan Paul@rohanpaul_ai·4d

This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning. The unsettling part is not that frontier models make arithmetic mistakes. It is that they can reach the right answer, see the right answer in someone else’s solution, and then forgive broken logic that should have been easy to catch. The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion. Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean. The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation. A careful evaluator should say, “Yes, the answer is right, but the argument does not justify it.” Many reasoning models instead appear to do something lazier and more dangerous: they solve the problem themselves, confirm the final answer, and then rationalize the path as acceptable. That is not reasoning vigilance. It is answer confirmation bias wearing the costume of mathematical judgment. The mechanism matters because modern AI training often rewards outcomes more than valid intermediate thought. A model trained to get the answer may learn to treat the answer as the evidence, especially when grading another chain of reasoning. Humans were not perfect here, but the contrast is revealing: people showed only a small drop from solving to grading, while models collapsed much more sharply on the same kind of task. This is where the result becomes larger than math. If AI systems can mass-produce plausible arguments but cannot reliably police the logic inside them, they become engines of confidence rather than engines of understanding. ---- Link – arxiv. org/abs/2606.01462 Title: "An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models"

English

17

24

102

7.5K

Sun Ming Zhong@SMZ_0001·4d

Thank you for sharing our work! @rohanpaul_ai

Rohan Paul@rohanpaul_ai

This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning. The unsettling part is not that frontier models make arithmetic mistakes. It is that they can reach the right answer, see the right answer in someone else’s solution, and then forgive broken logic that should have been easy to catch. The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion. Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean. The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation. A careful evaluator should say, “Yes, the answer is right, but the argument does not justify it.” Many reasoning models instead appear to do something lazier and more dangerous: they solve the problem themselves, confirm the final answer, and then rationalize the path as acceptable. That is not reasoning vigilance. It is answer confirmation bias wearing the costume of mathematical judgment. The mechanism matters because modern AI training often rewards outcomes more than valid intermediate thought. A model trained to get the answer may learn to treat the answer as the evidence, especially when grading another chain of reasoning. Humans were not perfect here, but the contrast is revealing: people showed only a small drop from solving to grading, while models collapsed much more sharply on the same kind of task. This is where the result becomes larger than math. If AI systems can mass-produce plausible arguments but cannot reliably police the logic inside them, they become engines of confidence rather than engines of understanding. ---- Link – arxiv. org/abs/2606.01462 Title: "An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models"

English

0

2

8

919

Sun Ming Zhong@SMZ_0001·14 Haz

💡 Why does this matter? As people increasingly use frontier models to write research papers, produce proof attempts, or generate persuasive arguments, this gap between producing arguments and vigilantly assessing them becomes a societal vulnerability, not just a technical one: If AI can produce plausible-sounding reasoning at scale, but not help us weed out what’s actually invalid, our ability to do science and make sense of the world may be significantly harmed. How might we address this gap? In The Enigma of Reason (2017) — one of the inspirations for our work — the cognitive scientists Hugo Mercier and Dan Sperber suggest that human reasoning evolved via social incentives, and that being critical evaluators allows us to gain the benefits of others’ thinking while avoiding being misled. In contrast, AI models are trained to reason in isolation, resulting in very different incentives. By learning from human cognition, we could potentially reduce the production-evaluation gap. (*Results on Fable 5 are freshly run, and not yet included in our paper.) 🤝 Joint work with Teresa Yeo (@aseretys), Armando Solar-Lezama, and Tan Zhi-Xuan (@xuanalogue). 📄 Paper: arxiv.org/abs/2606.01462. #LLMs #LRMs #Reasoning #AI4MATH #CogSci

English

2

3

15

1.1K

Sun Ming Zhong@SMZ_0001·14 Haz

Of course, CoTs need not be faithful to what to the evaluator models are doing under-the-hood, so we also find mechanistic evidence of the bias at work: : 🔸Linear Probes: Using probes trained on LRM activations, we find that these activations encode some representation of valid reasoning. However, we also find that these internal representations get corrupted and dynamically overridden by the presence of a valid final answer in VAIR solutions. 🔸Causal Patching: By swapping the hidden states of a valid answer token with an invalid one, we can causally flip the model's validity verdict and activations —- once the model detects that the final answer is wrong, it is no longer inclined to judge invalid reasoning steps as valid Together, these results demonstrate the operation of answer confirmation bias at inference time. But is this bias ultimately the result of outcome-focused incentives at training time, as we hypothesize? We plan to investigate this in future work.

English

1

8

391

Sun Ming Zhong@SMZ_0001·14 Haz

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*. Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math? 👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap. 🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀 To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

English

6

10

45

11.1K

Sun Ming Zhong

Keşfet