David Ma

18 posts

David Ma

@dma2077

M-A-P, StepFun OProver, IV-Bench, ScaleLong

北京 Katılım Ekim 2023

249 Takip Edilen20 Takipçiler

Sabitlenmiş Tweet

David Ma@dma2077·2d

Thrilled to release our new paper: “OProver: A Unified Framework for Agentic Formal Theorem Proving.” TL;DR: We train Lean 4 theorem provers to use retrieval, compiler feedback, and iterative repair as part of the proving policy—not just as test-time tricks. OProver-32B achieves SOTA open-weight whole-proof proving results, reaching 93.3% Pass@32 on MiniF2F, 58.2% on ProverBench, and 11.3% on PutnamBench.

English

2.6K

David Ma@dma2077·2d

9/9 Check out OProver and let us know what you build with it. Special thanks to all M-A-P co-authors and the open-source Lean community. 📄 arXiv: arxiv.org/pdf/2605.17283 🤗 HF Paper: huggingface.co/papers/2605.17… 💻 Code: github.com/multimodal-art… 🤗 Models + Data: huggingface.co/collections/m-…

English

David Ma@dma2077·2d

8/9 We release not just model weights, but the full stack: ✅ 7 OProver checkpoints across 8B / 32B ✅ OProofs: 1.77M statements, 6.86M verified proofs, 1.07M trajectories ✅ Training, inference, and evaluation code All available on Hugging Face and GitHub.

English

David Ma@dma2077·2d

English

2.6K

David Ma@dma2077·2d

5/9 Across five Lean 4 benchmarks, OProver-32B achieves 3 best and 2 second-best Pass@32 results — more top placements than any prior open-weight whole-proof prover. It outperforms Goedel-Prover-V2-32B, DeepSeek-Prover-V2-671B despite ~21× fewer parameters, and base LongCat-Flash-Prover despite LongCat’s 560B scale. Even OProver-8B beats Goedel-Prover-V2-32B on all five benchmarks.

English

David Ma@dma2077·2d

4/9 Data and policy co-evolve. Each post-training iteration runs agentic rollouts with the current prover, folds newly verified proofs into OProofs and retrieval memory, turns repair traces into SFT data, and uses hard unresolved cases for RL. The next prover trains with richer supervision and retrieves from a richer memory.

English

David Ma@dma2077·2d

3/9 We also release OProofs: 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized proving trajectories. Prior Lean datasets mostly keep the final proof. OProofs keeps the process: retrieved context, failed attempts, compiler feedback, and repairs — enabling trajectory-level supervision for agentic proving.

English

David Ma@dma2077·2d

2/9 Most Lean provers today are trained as single-shot whole-proof generators. Retrieval, compiler feedback, and self-correction are often bolted on only at inference time, creating a train–inference mismatch. OProver closes this gap with a unified loop: OProofs builds the corpus, agentic proving retrieves/verifies/repairs proofs, and agentic training folds verified proofs and repair traces back into the model.

English

David Ma retweetledi

Ge Zhang@GeZhang86038849·21 May

[1/n] 🚨 Game On for LLM Reasoning—Meet KORGym! 🎮✨ Ever wondered how to truly assess an LLM’s reasoning ability beyond memorized knowledge? Meet our latest breakthrough: KORGym—a dynamic, multi-turn game platform built to reveal the real reasoning skills of language models! ✨

English

4.1K

David Ma retweetledi

Ge Zhang@GeZhang86038849·20 May

[1/5] 💥 Facing the LLM Scaling Challenge Head-On! 💥 Glad to introduce MGA: Reformulation for Pretraining Data Augmentation! The AI world is grappling with data limitations and the performance hit from data repetition. We introduce MGA (Massive Genre-Audience) Reformulation – a groundbreaking, scalable data augmentation method to revolutionize how we train LLMs! 📄 Paper: arxiv.org/abs/2502.04235 #LLM #DataAugmentation #AIResearch #Innovation

English

10.4K

David Ma retweetledi

Qian Liu@sivil_taram·9 Oca

🎉 Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models. 🤝 Looking to contribute? Join our Program Committee: bit.ly/4acBBjF 🔍 Learn more at: open-foundation-model.github.io #OpenScience #MachineLearning #FoundationModels 1/N

English

173

41.1K

David Ma retweetledi

Ge Zhang@GeZhang86038849·18 Ara

[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data. 🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content. 🧑‍⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test). 🪙We provide all the full-process materials for training fasttext and bert. 📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])

English

160

24.3K

David Ma retweetledi

Ge Zhang@GeZhang86038849·4 Ara

[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd? It's time to put your code LLMs to the test!!! 📝

English

136

46.5K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry