Simon Frieder

218 posts

Simon Frieder banner
Simon Frieder

Simon Frieder

@friederrrr

Making the hills LLMs can climb towards becoming Math Copilots. AIMO Prize Manager. https://t.co/ir85qxw65J (Opinions my own.)

Katılım Ocak 2023
54 Takip Edilen276 Takipçiler
Simon Frieder
Simon Frieder@friederrrr·
@askalphaxiv Cool to see the benchmarking space growing. It seems to me that the FrontierMath dataset already did something very similar last year?
English
0
0
0
18
alphaXiv
alphaXiv@askalphaxiv·
"First Proof" A team of researchers proposes a way to test if AI can actually do NEW math by releasing 10 freshly-solved and never public research questions, with answers temporarily encrypted. This let's the community able to measure the genuine performance of LLMs on proof-generation, before their solutions drop. Questions include: - stochastic analysis - p-adic representation theory - algebraic combinatorics - spectral graph theory - equivariant algebraic topology - lattices in Lie groups/topology - symplectic geometry - tensor algebraic relations - numerical linear algebra
alphaXiv tweet media
English
42
149
849
75.8K
Simon Frieder
Simon Frieder@friederrrr·
Papers like these are important for people competing in big reasoning competitions like AIMO or ARC-AGI. The problem is that if one takes a closer look, there are some issues with the impressive claims: - MATH is an outdated benchmark by now - the numbers don't add up. The last sentence on page 1 states "Qwen-2.5-7B-Instruct improves from 76% to 95% while training just 10,000 parameters". This conflicts with table 2, which in turn is also unclear, as the parameter count doesn't seem to match with the # column.
Simon Frieder tweet mediaSimon Frieder tweet media
English
1
1
9
943
Simon Frieder
Simon Frieder@friederrrr·
Asking an AI system for an opinion is never a good idea. I am withholding judgement in whether to be impressed - it really depends on a lot of details: how many mathematicians tried and failed to prove the problem before (impossible to quantify, but that would be a measure of difficulty), what techniques the proof that was found uses (is it merely an obvious application of a know theorem that mathematicians overlooked, or did it introduce a new solution technique), etc. An expert in the field needs to answer this -- not me, and definitely not Grok. LLMs still don't know what they don't know. This is way over the head of Grok, and it's "arguments" are very weak since they would apply to any other piece of autoformalized result.
English
1
1
0
291
Axiom
Axiom@axiommathai·
1/ AxiomProver has solved Fel’s open conjecture on syzygies of numerical semigroups, autonomously generating a formal proof in Lean with zero human guidance. This is the first time an AI system has settled an unsolved research problem in theory-building math and self verifies.
Axiom tweet mediaAxiom tweet media
English
87
449
2.4K
1M
Simon Frieder
Simon Frieder@friederrrr·
Is mathematics a game that is still worth playing in the long term? The Twittersphere abounds with examples of what LLMs can do in math -- optimism is sky-high. (I don't quite share that optimism since (open-source) LLMs do not even manage to solve the all "simple" unseen problems we have over at the AI Math Olympiad with the LB being stuck at 44/50.) If that optimism pans out, even more maths will be created (rather than read) in the near future. While at first it will be exciting to watch conjectures fall, I am wondering what personal motivation will be left in such a full-automation scenario to get good and study mathematics.
English
1
0
2
198
Simon Frieder
Simon Frieder@friederrrr·
"To the wider community interested in Erdős problems, we caution that even after correctly solving an Erdős problem, one should take care to ensure the statement accurately reflects what Erdős likely intended (this issue is discussed further below)." This seems to be a very hard problem to solve -- and no, Lean (which is the usual answer when one points out tricky problems with natural-language mathematics) won't help here.
English
0
0
0
32
Quoc Le
Quoc Le@quocleix·
Excited to share our latest work: "Semi-Autonomous Mathematics Discovery with Gemini." We used Gemini to systematically evaluate 700 "open" conjectures in the Erdős Problems database. The result? We addressed 13 problems marked as open—finding 5 novel autonomous solutions and identifying 8 existing solutions missed by previous literature. Read the full case study here: arxiv.org/abs/2601.22401
Quoc Le tweet media
English
45
207
1.3K
245.8K
Simon Frieder
Simon Frieder@friederrrr·
In a year, we will be living in a world of mathematical ... AI slop! I'd hope things would be different, but where they are headed now, I fear many technical, and completely uninteresting results will flood the space. arXiv already had to stop accepting position papers, and the same will happen for technical, niche research papers rather soon. There will be an occasional jewel where AI genuinely helped (although just autoformalizing right now is not that exciting to the mainstream AI-sceptical mathematician), but most pebbles that fall out of LLMs won't be these jewels, they'll be cobblestones.
Jared Duker Lichtman@jdlichtman

In a year, we will be living in a world of mathematical abundance.

English
4
1
6
1K
Simon Frieder
Simon Frieder@friederrrr·
Math abundance -- or math AI slop? Other domains already were slop-ified: after the initial "wow" effect is gone, the limits of AI systems quickly emergence, whether it the latest Sora model not being useful to generate videos that are very long, nor to the best vibe-coding models that can generate frontend and backup of website -wow, incredible!- but wrestling the website into your specific requirements then turns out to be much harder. I predict that in a year we'll have the equivalent for math: lots of very technical, very uninteresting results. Sure, some tools will turn out to be useful -- but the abundance we'll have isn't a positive one.
English
2
1
6
1K
Jared Duker Lichtman
Jared Duker Lichtman@jdlichtman·
In a year, we will be living in a world of mathematical abundance.
English
19
25
285
35.3K
Simon Frieder
Simon Frieder@friederrrr·
There isn't word that is needed here to capture this, and I used the "queue" in one of my (still fledging) blog posts friederrr.org/blog/researche… for this, as the ideas that are _up there_. Should people be rewarded from plucking ideas from the queue? It's debatable with both good pros and cons.
English
0
0
0
23
Jeff Rose
Jeff Rose@rosejn·
I guess what I often wonder when you bring up that someone, possibly yourself, had written an idea previously, is whether it matters if it hadn’t been impactful. So often multiple people arrive at the same idea independently because the cognitive building blocks are in place and/or the next steps are in the zeitgeist. Every duplication is not plagiarism, and sometimes it’s the communication or application or implementation of an idea that makes it impactful. I would be curious to hear your thoughts on this.
English
3
0
2
932
Jürgen Schmidhuber
Jürgen Schmidhuber@SchmidhuberAI·
Social media are full of misinformation about AI history. To all "AI influencers:" before you post your next piece, take history lessons from the AI Blog, with chapters on: Who invented artificial neural networks? 1795-1805 Who invented deep learning? 1965 Who invented backpropagation? 1676-1970 Who invented convolutional neural nets? 1979-1988 Who invented generative adversarial networks? 1990 Who invented Transformer neural networks? 1991-2017 Who invented deep residual learning? 1991-2015 Who invented neural knowledge distillation? 1991 Who invented the transistor? 1925 Who invented the integrated circuit? 1949 Who created the general purpose computer? 1936-1941 Who founded theoretical CS and AI theory? 1931-34 And many more ... people.idsia.ch/~juergen/blog.…
Jürgen Schmidhuber tweet media
English
44
60
443
45.3K
Simon Frieder
Simon Frieder@friederrrr·
@mathematics_inc 3/ All of this is not to say that "hitting math with technology" is the wrong approach -- but Lean is only *one* toolbox one needs to use, and automation within natural language will also need to happen to close the iteration loop for mathematicians.
English
0
0
1
122
Simon Frieder
Simon Frieder@friederrrr·
2/ Aside from this example about limits that highlights one problematic instance with formal mathematics, there are also other instances where Lean isn't the best choice; here are just three examples out of many: - Keeping proofs concise is easy in natural language, but for Lean it will likely be hard to develop a layer that summarizes things to make proofs more easily readable; - Conjecturing is probably best done in natural language, to not get bogged down with the technical overhead associated to Lean; - No flexibility: HoTT was driven forward by a deeper analysis of the concept of equality. Handling this in natural language is much more flexible than a formal system. Even if you conceptually are rooted in ZFC, which is more clunky in this regards than HoTT, in natural language you can deal with to things being equal on "a higher level" according to whatever theory you're developing, even if these things are not equal as sets. If you do mathematics formally, you're stuck with whatever foundation was used, which changes how easily you can express equality.
English
1
0
2
135
Math, Inc.
Math, Inc.@mathematics_inc·
🚨 BREAKING: Fields medalist Terry Tao on how mathematics will change: “When these tools are perfected, we will change the way we do mathematics. If there's a drudgery or a big computation, we'll just hit it with all our technology and say: 'By Gauss, you can get from here to there,' and now we just keep going. So we can blast through all these obstacles that we avoid almost subconsciously. If you look at what we miss, it's the missed opportunities, and that percentage of the overall opportunities is huge.” Full conversation with Math, Inc.’s @jessemhan and @jdlichtman coming soon.
English
17
165
1.2K
137.3K
Simon Frieder
Simon Frieder@friederrrr·
@deredleritt3r Hi, one of the two authors here. Trust me, we did get that memo. But it was not relevant to our paper. This paper seems to have been more controversial than intended, but only because most people glossed over the finer details. Your somewhat emotional post also left me with that impression. 1) In the abstract we said "contrary to optimism about LLMs problem-solving abilities" and _not_ "contrary to LLMs that solve IMO problems," which is what you imply. The LLMs that we tested all had rather good problem-solving abilities and IMO-level problems are within their reach (particularly if the IMO problems are in the training data), even though those specific LLMs failed to do well on IMO25 (whose problems were likely not in the training data, al comparing to those problems would actually be unfair). 2) Also, you seem to not have taken a look at our "Limitations" section where we clearly anticipated that LLMs will solve our problem - as GPT 5 Pro did some time after our release. Once it did, many people seemed to have a reaction of the type "take that, proved ya wrong!" but in our paper we were clear that we fully expected that and we would rather have only been surprised if that would not have been the case.
English
1
0
2
74
prinz
prinz@deredleritt3r·
August 2025: Oxford and Cambridge mathematicians publish a paper entitled "No LLM Solved Yu Tsumura's 554th Problem". They gave this problem to o3 Pro, Gemini 2.5 Deep Think, Claude Opus 4 (Extended Thinking) and other models, with instructions to "not perform a web search to solve the problem". No LLM could solve it. The paper smugly claims: "We show, contrary to the optimism about LLM’s problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists—Yu Tsumura’s 554th problem—that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source)." (Apparently, these mathematicians didn't get the memo that the unreleased OpenAI and Google models that won gold on the IMO are significantly more powerful than the publicly available models they tested. But no matter.) October 2025: GPT-5 Pro solves Yu Tsumura's 554th problem in 15 minutes. Lee Sedol moment is coming for many.
Bartosz Naskręcki@nasqret

GPT-5-Pro solved, in just 15 minutes (without any internet search), the presentation problem known as “Yu Tsumura’s 554th Problem.” arxiv.org/pdf/2508.03685 This is the first model to solve this task completely. I expect more such results soon — the model demonstrates a strong grasp of elementary abstract algebra reasoning.

English
45
113
1.2K
219.3K
Simon Frieder
Simon Frieder@friederrrr·
Functionally there is not a big difference. But there is one in terms of stability, since arXiv is essentially a non-profit, and Reddit is a business. I'd like to have a record of comments indefinitely and we saw with paperswithcode.com how annoying it can be when a business (Meta in that case) pulls the plug. We now have a million clones of the original site, but none as good as the original, which is annoying. arXiv helps as as a trusted source that stood the test of time. Definitely agree that having a preprint hosting service is necessary but not sufficient for a thriving open scientific environment. My utopic vision would be one of full integration, where AIMOx, for some integer x, is hosted on a community-supported platform, and one could directly link between arXiv preprints like arxiv.org/pdf/2504.16891 and AIMOx, knowing that there is a clear organizational structure that is not influenced by ever-changing business cycles. It's unlikely to happen soon though :D
English
1
0
1
29
Learning CUDA & CuTe
Learning CUDA & CuTe@ClydeCompute·
@friederrrr What's the different between arXiv having comments and a subreddit? I think it's better for research as a whole to do significantly more than just pre-prints. AIMO has massively improved LLM math evals as an example. What else could it translate to? Whole paradigm needs a shift
English
1
0
0
34
Simon Frieder
Simon Frieder@friederrrr·
One more nail in the coffin for a broken reviewing system. Who knows how many people silently used this backdoor to see who gave them bad scores -> potentially career-damaging. It would be much better to have a comment section in arXiv, this would solve 80% of the existing problems: only people that are actually interested in the paper would read it, there would be no arbitrary cutoff who made it or not and no useless scores (the infamous NeurIPS experiment demonstrated -unsurprisingly- a large degree of subjectivity inverseprobability.com/talks/notes/th…), nasty reviews would occur less frequently if your name is attached, there is no possibility to have an ongoing dialogue that reflect what the community thinks of the paper since nothing can be done after the rebuttal phase which is artificial and sometimes a longer dialogue would beneficial (who remember the "Understanding deep learning requires rethinking generalization" paper ? openreview.net/forum?id=Sy8gd… ), etc etc
Simon Frieder tweet media
English
1
0
3
322
Simon Frieder
Simon Frieder@friederrrr·
2023: It's hard to devise an LLM that solves a math problem. 2025: It's hard to devise a math problem that stumps and LLM. (...this in the context of competitive math questions, but we'll also get to research-level math soon)
English
0
1
5
739
Simon Frieder retweetledi
AIMO Prize
AIMO Prize@AIMOprize·
AIMO3 is full of surprises: week 2 (out of 21) just concluded. After a race in the first week that had us both biting our nails to see how quickly the leaderboard is rising and cheering for the progress of open-weight LLMs, the leaderboard suddenly ground to halt.
AIMO Prize tweet media
English
3
2
13
1.2K
Simon Frieder
Simon Frieder@friederrrr·
Haha, not sure if I should be happy or sad to see this problem we found just in summer that stumped all LLMs to be demolished so convincingly. It is true that is was more or less clear that a system like Vampire would be able to do it, but it is quite cool that a DL-inspired approach could solve it too.
English
1
0
2
177
Bartosz Naskręcki
Bartosz Naskręcki@nasqret·
Aristotle by @Harmonic is acing group-theory puzzles. Here is a complete formal proof of the popular Yu Tsumura 554 puzzle. What's nice is that the proof is very transparent, with easy-to-follow steps. It was generated in less than an hour without any hints. I am attaching the full proof for inspection.
Bartosz Naskręcki tweet media
English
13
39
359
41.8K