Jui-Hui Chung (@juihuichung) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Jui-Hui Chung@juihuichung·16 Tem

Excited to share that our Goedel-Prover team just achieved state-of-the-art in automated theorem proving! I’m honored to have contributed alongside an incredible group of researchers.

Yong Lin@Yong18850571

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

0

1

143

Jui-Hui Chung retweetledi

Ziran Yang@__zrrr__·26 Şub

We released Goedel-Prover-V2, a state-of-the-art model for formal theorem proving at launch. Remarkably, it has remained at the top of the open-source formal theorem proving leaderboard for over six months. We have been excited to see so many folks cooking with our models. Today, we are open-sourcing the full Goedel-Prover-V2 training datasets for the community: 📂 SFT (1.74M samples) huggingface.co/datasets/Goede… 📂 RL (whole proof generation + self-revision, 98k samples) huggingface.co/datasets/Goede… We hope this helps push formal theorem proving forward. Build on it! Amazing Collaborators: @Yong18850571 @sangertang1999 @Lyubh22 @juihuichung @thomaszhao1998 @_LaiJiang @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

3

39

201

29K

Jui-Hui Chung retweetledi

Chi Jin@chijinML·1 Eki

🚀With early access to Tinker, we matched full-parameter SFT performance as in Goedel-Prover V2 (32B) (on the same 20% data) using LoRA + 20% of the data. 📊MiniF2F Pass@32 ≈ 81 (20% SFT). Next: full-scale training + RL. This is something that previously took a lot more effort on our Princeton cluster. Really impressed by how Tinker takes care of the heavy multi-GPU infra while still letting us control the algorithms. It feels great to focus on research instead of plumbing. This also echoes Thinking Machine Lab’s LoRA blog thinkingmachines.ai/blog/lora/ — makes me excited about what scalable and efficient systems like Tinker can bring to the research community. Thanks to the Thinking Machines team and Goedel team at Princeton @__zrrr__ @Yong18850571 @johnschulman2 @Tianyi_Zh @danqi_chen 🙌 Also check out our paper on Goedel prover: arxiv.org/abs/2508.03613

English

2

22

194

35.7K

Jui-Hui Chung retweetledi

Yong Lin@Yong18850571·14 Kas

[Life update] I’ve officially left @PrincetonPLI and joined Thinking Machines Lab @thinkymachines . It feels like the right time to look back on my journey at Princeton — one and a half years that were truly transformative. During this period, I made many friends, learned tremendously, and co-founded and co-led the Goedel Project. It was one of the most rewarding experiences of my life: a small, close-knit team of about ten people working with a clear purpose, moving fast, and ultimately building something impactful. In mid-July, we released Goedel-Prover-V2 (32B), a model that significantly outperformed the previous state-of-the-art DeepSeek-Prover-V2-671B on formal mathematical reasoning, using nearly 20× fewer parameters and dramatically less compute. Even now, four months after release, it still sits at the top of the open-source leaderboard. What makes this achievement especially meaningful is that we accomplished it entirely with academic resources. Competing against large industrial labs and still coming out ahead felt almost unreal. Seeing so many research teams now building on top of Goedel-Prover-V2 is deeply gratifying — it’s proof that open, academic AI can still make a real impact. Equally fulfilling was the journey itself. Unlike industrial teams with access to large-scale, off-the-shelf RL infrastructure, we — a group of students and researchers from academia with zero prior experience in massive model training — had to build almost everything from scratch. We learned quickly, identified problems as they emerged, and fixed them with remarkable speed. Designing, scaling, and successfully training a 32B-parameter model within just three months remains one of the things I’m most proud of. The Goedel Project began in October 2024. At that time, we had no serious experience training models that could compete with the best labs. DeepSeek-Prover-V1 and V1.5 looked unbeatable — they had started a year earlier and already set an incredibly high bar. We experimented with many ideas — agentic pipelines, divide-and-conquer methods — most of which turned out to be too costly or impractical given our limited resources. Eventually, we discovered a simple yet powerful iterative-training approach that allowed us to scale efficiently within our compute limits. Bit by bit, we caught up with DeepSeek-Prover-V1.5 — and then surpassed it. Princeton winters are brutally cold. It was the first time I’d ever seen snow last for weeks. I spent the entire winter break at home, running experiments, analyzing results, and adjusting training methods and data again and again. That persistence paid off: in February, we released Goedel-Prover-V1-7B, which captured the top spot on the leaderboard. It was our first major milestone — proof that an academic team could compete with frontier models. Our celebration was short-lived. In April, Kimi-Prover-72B and DeepSeek-Prover-V2-671B both arrived — and completely outperformed us. It was a tough moment. We couldn’t even host DeepSeek-Prover-V2-671B for inference locally; communication errors kept crashing our limited infrastructure. None of us had experience deploying or training models of that scale. Still, we decided to aim higher — to beat them in the next version. We began by identifying the bottlenecks in DeepSeek and Kimi’s provers, exploring every possible angle for improvement. We experimented with compiler-based feedback loops, curriculum data synthesis, self-improvement strategies, model distillation, and model merging to improve diversity during RL. But the most critical insight was about efficiency — optimizing how we allocated limited resources across training design, data generation, and scaling. Every GPU hour had to count. After two months of exploration and countless small-scale tests, we finally established a systematic framework for the next release: Goedel-Prover-V2. I led "The Big Run" — a nearly month-long sequence of two self-improvement fine-tuning cycles followed by one large-scale RL round. We completed training just a few days before our scheduled release, leaving barely enough time for evaluation. Those last nights were intense — running tests, fixing scripts, collecting metrics — but everything came together perfectly. When we saw the final results, we could hardly believe them: Goedel-Prover-V2 solved twice as many problems on PutnamBench as DeepSeek-Prover-V2-671B. Many people have since asked what the “key” was — how an academic team managed to outperform frontier labs using a fraction of their resources. There isn’t a single magic trick, but rather a combination of principles that guided us: * build solid infrastructure early * focus on real bottlenecks instead of chasing novelty * investigate broadly with small-scale experiments * fix problems in real time * optimize resource allocation carefully * execute the final big run with precision. Each of these steps sounds simple, but together they made all the difference. Now, at Thinking Machines Lab, I’m shifting focus beyond formal reasoning toward building general-purpose models. I’m deeply inspired by TML’s mission — developing interactive AI systems and advancing open science. I’m thrilled to begin this new chapter and look forward to sharing more in the future.

English

18

645

78.3K

Jui-Hui Chung retweetledi

Chi Jin@chijinML·15 Tem

🚀 Huge milestone from our Goedel-Prover team: we’ve just released a new state-of-the-art model (8B & 32B) for automated theorem proving—surpassing the previous best 671B DeepSeek model by a wide margin, all with academic compute!

Yong Lin@Yong18850571

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

4

12

85

31.6K

Jui-Hui Chung retweetledi

Sanjeev Arora@prfsanjeevarora·15 Tem

Formal math taking off at @PrincetonPLI ! New Goedel-Prover v2 8B model matches 2.5 month old Deepseek V2 prover 671B, but is 80x smaller. Our 32B model much better on all benchmarks (miniF2F, IMO, Putnam). I'm excited/shocked by how much this field has advanced in 6-7 months (since release of Goedel v1). Many technical innovations in v2, e.g., letting the prover generate its own training questions and incorporating compiler feedback. Self-improvement will be v. important to cracking college math (not enough training data!). Kudos to the team, esp. @Yong18850571 @chijinML . Onwards to v3!

Yong Lin@Yong18850571

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

0

19

92

8.4K

Jui-Hui Chung

Keşfet