Somesh Misra / ERP.ai

2.3K posts

Somesh Misra / ERP.ai

@MathproBro

chief researcher at https://t.co/85QLNI0SE9 | working at the intersection of business processes, neural network topologies & machine learning

San Francisco, CA Katılım Şubat 2013

272 Takip Edilen834 Takipçiler

Somesh Misra / ERP.ai@MathproBro·3d

This is great background for creating verification models that work solely on verification instead of generation. Maybe we do some circuit tracing to verify if reasoning generation and reasoning verification are distinct capabilities. And then distill a standalone verification model? A GAN type of setup. Has anyone explored a foundation level verifier trained primarily for verification rather than generation?

English

xuan (ɕɥɛn / sh-yen)@xuanalogue·6d

So perhaps by learning from how humans evolved to reason w each other, we can train models to be more epistemically vigilant like humans are, and not just solipsitic outcome-focused reasoners. Paper: arxiv.org/abs/2606.01462 x.com/SMZ_0001/statu…

Sun Ming Zhong@SMZ_0001

💡 Why does this matter? As people increasingly use frontier models to write research papers, produce proof attempts, or generate persuasive arguments, this gap between producing arguments and vigilantly assessing them becomes a societal vulnerability, not just a technical one: If AI can produce plausible-sounding reasoning at scale, but not help us weed out what’s actually invalid, our ability to do science and make sense of the world may be significantly harmed. How might we address this gap? In The Enigma of Reason (2017) — one of the inspirations for our work — the cognitive scientists Hugo Mercier and Dan Sperber suggest that human reasoning evolved via social incentives, and that being critical evaluators allows us to gain the benefits of others’ thinking while avoiding being misled. In contrast, AI models are trained to reason in isolation, resulting in very different incentives. By learning from human cognition, we could potentially reduce the production-evaluation gap. (*Results on Fable 5 are freshly run, and not yet included in our paper.) 🤝 Joint work with Teresa Yeo (@aseretys), Armando Solar-Lezama, and Tan Zhi-Xuan (@xuanalogue). 📄 Paper: arxiv.org/abs/2606.01462. #LLMs #LRMs #Reasoning #AI4MATH #CogSci

English

301

xuan (ɕɥɛn / sh-yen)@xuanalogue·6d

Excited to share the first pre-print from our lab led by @SMZ_0001! In "An Enigma of Artificial Reason", we find that reasoning-trained LMs excel at *producing* reasoning, but struggle to *evaluate* reasoning that reaches valid answers for invalid reasons, scoring as low as 48%.

Sun Ming Zhong@SMZ_0001

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*. Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math? 👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap. 🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀 To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

English

7.3K

Somesh Misra / ERP.ai retweetledi

Timothy Nguyen@IAmTimNguyen·9 May

Mathematics as a field is going to have to reorient itself in light of powerful AI. But a slight pushback to Gowers's comment: "If LLMs are at the point where they can solve 'gentle problems', ...the lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting." Mathematics is infinite and thus inexhaustible. By having powerful AIs that can do heavy lifting, more of the burden is shifted towards taste and asking the right question. The possibility of discovering something by looking in the right place that everyone else missed becomes possible. In mathematical physics for instance, an Einstein with inspiration of the equivalence principle might not have to toil for a decade to invent general relativity, but could have equations proposed, their solutions found, and scenarios validated as limits of Newtonian physics. Contributing to mathematics, rather than having the bar raised for problem-solving, has opened up for ideation and generation.

Timothy Gowers @wtgowers@wtgowers

But if AI mathematics continues to progress at anything like its current rate -- which is what I expect to happen -- then we will face a crisis very soon, and mathematics departments, who owe a duty of care to their students, should be urgently preparing for it.

English

222

42.1K

Somesh Misra / ERP.ai@MathproBro·13 Nis

@xuanalogue looked at your CLIPS paper, so yes, an AI that truly infers a student's hidden goals and epistemic state might enable persistence instead of enabling shortcuts. :)

English

xuan (ɕɥɛn / sh-yen)@xuanalogue·13 Nis

Without having seen systematic studies of the effects of AI on learning outcomes, I hesitate to form a definite opinion, but I wouldn't be too surprised if the overall effect is net negative + an increase in inequality bc intrinsically motivated students benefit more.

English

666

xuan (ɕɥɛn / sh-yen)@xuanalogue·13 Nis

Others have said this, but when it comes to AI for education, I think people have overrated the value of access to expertise & knowledge (which I think AI increases) and underrated the value of motivational scaffolding (which AI degrades by making it easier to cut corners).

Marc Porter Magee 🎓@marcportermagee

What’s it like in college right now when you actually want to learn while everyone—students, tutors and professors—is cutting corners with AI. “That was basically the end of our session,” Lahr said. “I had a crashout about that afterwards because I was like, Why am I even here?”

English

229

11.8K

Somesh Misra / ERP.ai retweetledi

sphinx@protosphinx·7 Mar

sarvam is doing some phenomenal work. seeing positive commentary on r/locallama too

Pratyush Kumar@pratykumar

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and AIKosh. Thanks to the good folks at SGLang for day 0 support, vLLM support coming soon. Links, benchmark scores, examples, and more in our blog - sarvam.ai/blogs/sarvam-3…

English

565

12.9K

Somesh Misra / ERP.ai@MathproBro·23 Şub

Paper: “Demystifying Oversmoothing in Attention-Based Graph Neural Networks” (NeurIPS 2023, spotlight) By Xinyi Wu, Amir Ajorlou, Zihui Wu & @jababi at MIT/Caltech. Key move: they model attention-based GNNs as nonlinear time-varying dynamical systems and use joint spectral radius theory to prove oversmoothing is inevitable for GCNs, GATs, and graph transformers. Covers ReLU, LeakyReLU, GELU, SiLU. No architectural trick escapes it. The only way out is rethinking how depth is applied. 📄 arxiv.org/abs/2305.16102

English

Somesh Misra / ERP.ai@MathproBro·23 Şub

Everyone thought attention would solve oversmoothing in GNNs. It doesn’t. It can’t. Rigorous proof: expressive power in attention-based GNNs collapses exponentially with depth. GATs, graph transformers - none are immune. The real insight? Depth shouldn’t be uniform. A boundary node sitting between two communities needs 2 layers. An interior node in a dense cluster might need 10. Treating them the same is the actual problem. Structure should dictate depth. Not the other way around.

English

110

Somesh Misra / ERP.ai@MathproBro·9 Şub

This nomenclature always confused me! NP hard sounds like it's a subset of NP, but NP is verifiable, and NP hard is hard to solve. Knuth suggested three names "Herculean", "Formidable", and "Arduous", and sent out a poll to people in theory community. one write-in suggestion was "Hard-Ass Problems" (Hard As Satisfiability). Bell Labs won with "NP-hard" and they've been confusing people ever since. The real NP-hard problem was naming NP-hard.

English

sphinx@protosphinx·9 Şub

ZXX

3.3K

Somesh Misra / ERP.ai@MathproBro·6 Şub

Underlying reason: Continuity and symmetry induce equivalence classes over inputs. Transformers collapse nearby sequences into the same representation orbit. Perplexity is invariant on these orbits. Correctness is not. This was never about Perplexity the company. It is about algebra, group actions, and quotient spaces.

English

Somesh Misra / ERP.ai@MathproBro·6 Şub

Paper link: arxiv.org/abs/2601.22950 cc @PetarV_93 Thank you for formalizing something many of us felt but could not prove.

English

159

Somesh Misra / ERP.ai@MathproBro·6 Şub

Perplexity is not always right. It can appear confident and rigorous, and it can score extremely well by its own metric, while still producing an incorrect prediction. This is not a bug or a training artifact. The result comes from the paper “Perplexity Cannot Always Tell Right from Wrong”

English

190

Somesh Misra / ERP.ai@MathproBro·29 Oca

This insight leads to a set of fundamental group theory based results. I have tried to characterize which forms of node-level memorization are inevitable in GNNs and which require symmetry breaking. Paper coming after review.

English

Somesh Misra / ERP.ai@MathproBro·29 Oca

Hot take: a lot of GNN memorization isn’t learned at all. It’s forced. Graph symmetry + training dynamics decide what a GNN can and cannot memorize — before data even enters the picture.

English

Somesh Misra / ERP.ai@MathproBro·13 Oca

Three claims/theorems about deep learning that seem difficult to disprove and even harder to prove: A) Gradient descent does more than minimize loss. It reshapes geometry by collapsing directions that are irrelevant to the task (gradient flow induces anisotropic contraction in the pullback metric, with decay along directions orthogonal to the loss gradient). B) Symmetry does not need to be imposed. When data and objectives are invariant, training dynamics tend to uncover quotient structure implicitly (optimization trajectories concentrate on equivalence classes induced by approximate group orbits, even without architectural equivariance). C) Memorization is not storage. It is the emergence of extremely sharp decision geometry confined to negligible-volume regions (interpolation is achieved via high-curvature decision boundaries localized to sets of vanishing measure in input space). These are not easy theorems. But they feel like the right ones to chase. Genuinely looking for advice, counterexamples, or references from people thinking deeply about this: @levie_ron @kamalikac @rsalakhu @ok1zjf @neelnanda5 @mmbronstein

English

163

Somesh Misra / ERP.ai@MathproBro·3 Oca

A doubly stochastic matrix only redistributes values. It cannot amplify them or destroy them. Geometrically, it is a soft mixture of permutations. It shuffles and mixes, but conserves total signal. Identity is one extreme case of this. So mHC does not abandon the identity idea. It generalizes it. Identity becomes a stable geometric object instead of a single point. That is the breakthrough: deep learning stability enforced by geometry, not tricks.

English

Somesh Misra / ERP.ai@MathproBro·3 Oca

That learned matrix gets applied again and again across layers. Now depth is no longer identity plus correction. It is repeated application of an unconstrained matrix. We are back to the original instability problem. mHC fixes this by using geometry. Instead of letting the identity be any learned matrix, it restricts it to a special space called doubly stochastic matrices. No math needed. Here is the intuition.

English

106

Somesh Misra / ERP.ai@MathproBro·2 Oca

The DeepSeek mHC paper is a real breakthrough, and the reason is geometric, not architectural. Early neural networks were just repeated matrix multiplications: x <- W x. Depth was unstable. ResNets changed one line: x <- x + F(x) which linearizes to x <- (I + W)x. That single identity term is what made deep learning scale. Hyper-Connections broke this by replacing identity with a learned matrix, turning depth back into unconstrained matrix products. mHC fixes this in a principled way. Instead of identity or an arbitrary matrix, mHC uses a doubly stochastic one. Doubly stochastic matrices form the Birkhoff polytope. They are convex combinations of permutations. Geometrically, the residual stream undergoes conservative transport and mixing, not amplification or decay. Identity is just one extreme point of this space. Under composition, stability is preserved. mHC does not abandon identity. It generalizes it into a stable geometric object. This is not an engineering trick. It is linear algebra and geometry doing the real work

English

660

Keşfet

@SMZ_0001 @xuanalogue @jababi @PetarV_93 @levie_ron @kamalikac @rsalakhu @ok1zjf