Mayur Naik

445 posts

Mayur Naik banner
Mayur Naik

Mayur Naik

@AI4Code

Professor @CIS_Penn | Founder & Co-CEO @Rabdos_AI | Neurosymbolic AI researcher and educator

Philadelphia, PA Katılım Ocak 2019
357 Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
Mayur Naik
Mayur Naik@AI4Code·
Friends, followers, and strangers on X, I recently got excited about mapping the jagged frontier of AI models in math and other STEM areas. With my colleague Prof. Rob Ghrist @prof_g , I founded Rabdos AI rabdos.ai to create original, research-level problems at scale to advance frontier AI capabilities. In the coming weeks, we'll be sharing exciting technical updates through @Rabdos_AI. Please give us a follow if you're interested in model evaluation, AI for math/science, autoformalization in Lean, robustness of world models, and much more.
Mayur Naik tweet media
English
1
6
33
1.8K
Mayur Naik retweetledi
Thang Luong
Thang Luong@lmthang·
Very excited to share a new milestone in AI for Math: Aletheia, powered by Gemini Deep Think, was just used to autonomously solve a Kirby problem! “Kirby’s list” is a “compendium of the most important unsolved problems in topology, the study of deformable shapes” (Quanta magazine). 🧵
Thang Luong tweet media
English
24
83
453
65.1K
Mayur Naik retweetledi
Rabdos_AI
Rabdos_AI@Rabdos_AI·
We are delighted to unveil our research blog Rabdology at rabdology.ai, where we chart the jagged math-frontier of AI reasoning. This is our first post in a weekly series. Read on, and if you enjoy it, please subscribe! (Link at bottom of blog's main page.) The Three-Cylinders Problem: When AI models choose Beauty over Truth rabdology.ai/three-cylinders We pose a problem that a good geometry student can solve in twenty minutes. We gave it to four of the world’s most advanced AI models and watched what happened. Three of them got it wrong — and the way they got it wrong tells you something different about the state of AI mathematical reasoning than the usual benchmarks.
Rabdos_AI tweet mediaRabdos_AI tweet media
English
0
8
13
1.4K
Mayur Naik
Mayur Naik@AI4Code·
Very timely, especially in light of revelation that 1/3rd of problems in FrontierMath are fatally flawed. As expert human validation of frontier math tasks approaches its inevitable limit, LLMs are stepping in to fill the void. But our work below shows that the discovery of mistakes in FrontierMath problems didn't necessarily have to wait for a frontier model like GPT-5.5. Much smaller and even open-source models can be as effective at verifying math proofs: their weights embody the necessary knowledge, as one might expect -- checking proofs ought to be easier than writing proofs. The crucial thing that makes this work is the use of "prompt ensembles", each of which modularly checks a different facet of a given proof, and some of which are even specific to the domain/sub-domain of math. Ideas like meta-prompting, agent skills, and autoresearch will undoubtedly evolve to make LLMs as judges of math proofs even more effective in future.
guru@guruprerana

Do we need frontier models to verify math proofs? EpochAI just announced that they found several fatal flaws in their FrontierMath benchmark using GPT-5.5. But isn't verification supposed to be easier than generation, so why were they not spotted earlier? In our recent work, we asked a related question: do we really need frontier-scale compute to verify Olympiad-level math proofs? Turns out, even 20B open-source models can keep up with frontier LLMs on proof verification. Work done with my co-authors @aaditya_naik, @AI4Code, and @RajeevAlur Preprint: arxiv.org/abs/2604.02450

English
1
1
10
1.7K
Mayur Naik retweetledi
Shriram Krishnamurthi (primary: Bluesky)
I feel sorry for either this person or their PhD advisor for them to say "The thing that still matters —> taste <— was never really taught". This is *most* of what my advisor taught me; the rest (semantics, coding, etc) was just hard work; he answered questions and gave feedback.
Amber Liu@JIACHENLIU8

I was talking to a CS professor yesterday and he told me he's so relieved ⛱️ now that his PhD students have left for their summer internships.🌞. He finally gets to do interesting research himself again. It wasn't a complaint. Just something he'd been sitting with. The old PhD bargain was pretty clean: professors had ideas, students had time, execution was the bridge. You did the work, and somewhere along the way, taste rubbed off. That bridge is mostly gone now. A senior researcher with taste and a research agents ships faster than the same researcher plus a junior student. Explaining 🎙️ an idea to someone who doesn't yet have taste costs more than just building it. (the reason professor still needs students is because the research agent 🤖 is not yet capable enough) Nobody wants to say this out loud Professors keep mentoring because the system 🏫 asks them to, not because it's the highest-leverage thing they could be doing with their week. The thing that still matters —> taste <— was never really taught. It was supposed to leak out of the execution. Now there's no execution to leak from.

English
5
14
139
28.3K
Mayur Naik retweetledi
Bartosz Naskręcki
Bartosz Naskręcki@nasqret·
I've just finished a week where I have been using agentic workflows basically 24/7. Here are some impressions: 1. I am extremely tired. Supervision of so many projects is distracting, frustrating and exhausting. 2. I have never worked on so many fronts at once. It's not only a superficial impression. I've read a ton, did courses with students, coded, learned new mathematical theorems, explored new git repos. 3. The more you work with agents, the better you become at instructing them. Don't overthink the prompts. Doing things step by step and letting the agent guide you is very productive. 4. I failed in some of those projects. Basically I got the impression that I could do everything with the agent. Maybe, but time and tokens are constrained. 5. I don't use much ready made software anymore. I build or scaffold stuff that I need. My favorite modality is text and best interface is simple text API. Everything can be pipelined. 6. Mathematics can be treated as software. Whatever you need to do with it, it can be decomposed into small parts. The thinking bit is still mostly done in my brain if necessary. 7. I think of my past life as a warm-up for this new period. The offline reading, coding and internalizing was super important. Without it, I would be a pure vibe-coder. With training, you sail a ship. 8. I don't want to see my paychecks. It's expensive, because the better you get, the more you spend on more and more projects. You abandon them in a more advanced state, but there is a token price to pay. 9. Each time I see a challenge, I think whether I can find a solution with the agent. I don't skip tasks. Sometimes I feel overly optimistic. It's part of the learning to see where the agents are limited. 10. I think I master entirely new workflows. They are not properly described in the books. It's a brave new world. Each standard task can be turned into a task for the agent. It's very enabling. Happy coding everyone!
English
20
40
317
40.8K
Mayur Naik retweetledi
Rabdos_AI
Rabdos_AI@Rabdos_AI·
In MathDuels leaderboard, Gemini-3-Flash dominates at its size: the only models ahead of it are the latest, largest frontier releases. Gemma-4-31b-it has the highest author rating of any opensource model. Thanks @GoogleDeepMind for these smart little models 🥹
Rabdos_AI@Rabdos_AI

Static math benchmarks saturate. We built one that doesn't. Announcing MathDuels, the first self-play math benchmark. Every frontier LLM writes problems for the others, and is graded on the ones written for it. As models improve, so does the benchmark.

English
0
2
13
1.3K
Mayur Naik retweetledi
Bartosz Naskręcki
Bartosz Naskręcki@nasqret·
Excellent idea. A revival of mathematical duels but in the AI form. Which model is the best problem composer and which one solves problems with no trouble. And the judge is a human professional.
Rabdos_AI@Rabdos_AI

Static math benchmarks saturate. We built one that doesn't. Announcing MathDuels, the first self-play math benchmark. Every frontier LLM writes problems for the others, and is graded on the ones written for it. As models improve, so does the benchmark.

English
2
10
65
5.3K
Mayur Naik
Mayur Naik@AI4Code·
Delighted to announce MathDuels, the first self-play math benchmark! We evaluated 26 frontier models across 780 generated problems from 30 math sub-domains. Check out mathduels.ai for the results, which we plan to update on a regular basis as new models enter the arena. This is the inaugural post in @Rabdos_AI 's research blog at rabdos.ai/research where we plan to chart the jagged edge of scientific reasoning in frontier AI models through exciting new weekly posts!
Rabdos_AI@Rabdos_AI

Static math benchmarks saturate. We built one that doesn't. Announcing MathDuels, the first self-play math benchmark. Every frontier LLM writes problems for the others, and is graded on the ones written for it. As models improve, so does the benchmark.

English
1
4
24
2.3K
Mayur Naik
Mayur Naik@AI4Code·
What if you could see cardiac arrest coming minutes to hours before it happens? We're building CAMEL, a foundation model for cardiology trained on the ECG signals now captured everywhere from ICUs to wristbands. penntoday.upenn.edu/news/using-ai-…
English
0
1
3
231
Mayur Naik retweetledi
Davis Brown
Davis Brown@davisbrownr·
In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.
Davis Brown tweet media
English
7
12
82
10.3K
Mayur Naik
Mayur Naik@AI4Code·
We uncover widespread cheating on popular agent benchmarks like Terminal Bench 2, including both harness-level cheating (likely due to vibe-coded agent harnesses) and task-level cheating by powerful underlying models. As autoresearch and meta-harness approaches take off, problems of harness-level cheating will get worse. We will soon release the code and a technical paper describing our scalable auditing system Meerkat which uncovered these problems. Great work by PhD students @adamlsteinl and @davisbrownr, joint with colleagues @RICEric22 and @HamedSHassani.
Adam Stein@adamlsteinl

We found widespread cheating on popular agent benchmarks, affecting 28+ submissions across 9 benchmarks and thousands of agent runs. Surprisingly, the top 3 submissions on Terminal-Bench 2 are all cheating! Here's what we found 🧵

English
0
1
8
1.3K