David Ma

18 posts

David Ma

David Ma

@dma2077

M-A-P, StepFun OProver, IV-Bench, ScaleLong

北京 Katılım Ekim 2023
249 Takip Edilen20 Takipçiler
Sabitlenmiş Tweet
David Ma
David Ma@dma2077·
Thrilled to release our new paper: “OProver: A Unified Framework for Agentic Formal Theorem Proving.” TL;DR: We train Lean 4 theorem provers to use retrieval, compiler feedback, and iterative repair as part of the proving policy—not just as test-time tricks. OProver-32B achieves SOTA open-weight whole-proof proving results, reaching 93.3% Pass@32 on MiniF2F, 58.2% on ProverBench, and 11.3% on PutnamBench.
David Ma tweet media
English
1
17
29
2.6K
David Ma
David Ma@dma2077·
8/9 We release not just model weights, but the full stack: ✅ 7 OProver checkpoints across 8B / 32B ✅ OProofs: 1.77M statements, 6.86M verified proofs, 1.07M trajectories ✅ Training, inference, and evaluation code All available on Hugging Face and GitHub.
English
1
0
0
80
David Ma
David Ma@dma2077·
Thrilled to release our new paper: “OProver: A Unified Framework for Agentic Formal Theorem Proving.” TL;DR: We train Lean 4 theorem provers to use retrieval, compiler feedback, and iterative repair as part of the proving policy—not just as test-time tricks. OProver-32B achieves SOTA open-weight whole-proof proving results, reaching 93.3% Pass@32 on MiniF2F, 58.2% on ProverBench, and 11.3% on PutnamBench.
David Ma tweet media
English
1
17
29
2.6K
David Ma
David Ma@dma2077·
5/9 Across five Lean 4 benchmarks, OProver-32B achieves 3 best and 2 second-best Pass@32 results — more top placements than any prior open-weight whole-proof prover. It outperforms Goedel-Prover-V2-32B, DeepSeek-Prover-V2-671B despite ~21× fewer parameters, and base LongCat-Flash-Prover despite LongCat’s 560B scale. Even OProver-8B beats Goedel-Prover-V2-32B on all five benchmarks.
David Ma tweet media
English
0
0
0
36
David Ma
David Ma@dma2077·
4/9 Data and policy co-evolve. Each post-training iteration runs agentic rollouts with the current prover, folds newly verified proofs into OProofs and retrieval memory, turns repair traces into SFT data, and uses hard unresolved cases for RL. The next prover trains with richer supervision and retrieves from a richer memory.
English
0
0
0
19
David Ma
David Ma@dma2077·
3/9 We also release OProofs: 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized proving trajectories. Prior Lean datasets mostly keep the final proof. OProofs keeps the process: retrieved context, failed attempts, compiler feedback, and repairs — enabling trajectory-level supervision for agentic proving.
David Ma tweet media
English
0
0
0
17
David Ma
David Ma@dma2077·
2/9 Most Lean provers today are trained as single-shot whole-proof generators. Retrieval, compiler feedback, and self-correction are often bolted on only at inference time, creating a train–inference mismatch. OProver closes this gap with a unified loop: OProofs builds the corpus, agentic proving retrieves/verifies/repairs proofs, and agentic training folds verified proofs and repair traces back into the model.
David Ma tweet media
English
0
0
0
22
David Ma retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] 🚨 Game On for LLM Reasoning—Meet KORGym! 🎮✨ Ever wondered how to truly assess an LLM’s reasoning ability beyond memorized knowledge? Meet our latest breakthrough: KORGym—a dynamic, multi-turn game platform built to reveal the real reasoning skills of language models! ✨
Ge Zhang tweet media
English
1
10
34
4.1K
David Ma retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/5] 💥 Facing the LLM Scaling Challenge Head-On! 💥 Glad to introduce MGA: Reformulation for Pretraining Data Augmentation! The AI world is grappling with data limitations and the performance hit from data repetition. We introduce MGA (Massive Genre-Audience) Reformulation – a groundbreaking, scalable data augmentation method to revolutionize how we train LLMs! 📄 Paper: arxiv.org/abs/2502.04235 #LLM #DataAugmentation #AIResearch #Innovation
Ge Zhang tweet media
English
3
19
80
10.4K
David Ma retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data. 🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content. 🧑‍⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test). 🪙We provide all the full-process materials for training fasttext and bert. 📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])
Ge Zhang tweet media
English
7
44
160
24.3K
David Ma retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd? It's time to put your code LLMs to the test!!! 📝
Ge Zhang tweet mediaGe Zhang tweet media
English
11
33
136
46.5K