Nikhil Srivastava

@Knikct

Berkeley Mathematics, Simons Institute for the Theory of Computing

Katılım Şubat 2026

75 Takip Edilen352 Takipçiler

Nikhil Srivastava@Knikct·4d

@dearlove165452 the zulip for the second batch isn't up yet, but you can see the discussion from the first round here: #narrow/channel/568090-first-proof" target="_blank" rel="nofollow noopener">icarm.zulipchat.com/#narrow/channe…

English

no@dearlove165452·4d

@Knikct on the webpage it says that there will be a Zulip discussion channel? I am a postdoc in math & do not want to bother you with emails, so I wanted to ask from here: can we get a link to the Zulip channel?

English

118

Nikhil Srivastava@Knikct·4d

some details about #1stproof second batch: 1stproof.org happy pi day!

English

4.9K

Nikhil Srivastava@Knikct·4d

@thebasepoint agreed! we are open to testing rigs on top of public models, subject to funding and logistical constraints, and as long as it is done transparently. see sec 3 of the announcement.

English

133

Joshua Batson@thebasepoint·4d

@Knikct I think there will be value for the eval of allowing labs or entities to give you some custom API or rig. Models or agents made for interactive or iterative use need to be scaffolded for peak single turn performance.

English

526

Nikhil Srivastava@Knikct·15 Şub

@HenokYemam @merettm please see #narrow/channel/568090-first-proof" target="_blank" rel="nofollow noopener">icarm.zulipchat.com/#narrow/channe… for a community discussion of the solutions.

English

623

Henok Yemam@HenokYemam·15 Şub

@Knikct @merettm What's your impression of OpenAI's solution? The problem is way outside of my reach that I would love to hear your take. Thanks!

English

1.8K

Jakub Pachocki@merettm·14 Şub

Very excited about the "First Proof" challenge. I believe novel frontier research is perhaps the most important way to evaluate capabilities of the next generation of AI models. We have run our internal model with limited human supervision on the ten proposed problems. The problems require expertise in their respective domains and are not easy to verify; based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising. We will only publish the solution attempts after midnight (PT), per the authors' guidance - the sha256 hash of the PDF is d74f090af16fc8a19debf4c1fec11c0975be7d612bd5ae43c24ca939cd272b1a . This was a side-sprint executed in a week mostly by querying one of the models we're currently training; as such, the methodology we employed leaves a lot to be desired. We didn't provide proof ideas or mathematical suggestions to the model during this evaluation; for some solutions, we asked the model to expand upon some proofs, per expert feedback. We also manually facilitated a back-and-forth between this model and ChatGPT for verification, formatting and style. For some problems, we present the best of a few attempts according to human judgement. We are looking forward to more controlled evaluations in the next round! 1stproof.org #1stProof

English

243

357

2.8K

2.5M

Nikhil Srivastava@Knikct·15 Şub

@merettm thanks for clarifying, looking forward to the next round!

English

Jakub Pachocki@merettm·15 Şub

Hi Nikhil! We will aim to publish more information next week, but as I noted above, this was a quite chaotic sprint (you caught us by surprise! please give us time to prepare next time!). We will not be able to gather all the transcripts as they are quite scattered. Some of the prompts included guidance to iterate on its previous work; e.g. the rollout that produced the solution to #6 was prompted with the problem statement followed by: "Trying using a BSS barrier type argument. You will have to think hard about the setup and the inductive framework to push it through." This guidance is based on the previous attempts by the model to solve the problem (which previously converged on this approach) and did not originate from the person who prompted the model; however, in a properly controlled experiment, we would avoid such manual prompting. (I didn't realize we used this prompting until today, otherwise I would have been more explicit about it in my original message!) Also, for problem #1 we told the model "Do not cite Hairer22 or use it as a reference because the link no longer works."

English

37.1K

Keşfet

@dearlove165452 @thebasepoint @HenokYemam @merettm @elonmusk @BarackObama @taylorswift13 @cristiano