Zonglin Yang

602 posts

Zonglin Yang banner
Zonglin Yang

Zonglin Yang

@Yang_zy223

Research Scientist @miromind_ai | LLMs for Scientific Discovery | Creator of the MOOSE series (Latest: MOOSE-Star 🌟)

Katılım Ocak 2022
721 Takip Edilen483 Takipçiler
Sabitlenmiş Tweet
Zonglin Yang
Zonglin Yang@Yang_zy223·
Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇
Zonglin Yang tweet media
English
2
15
62
167.2K
Zonglin Yang retweetledi
MiroMindAI
MiroMindAI@miromind_ai·
We post-train LLMs for math, for code, for instruction-following. Why not for scientific discovery? 🫎 MOOSE-Star (ICML 2026) : the first scalable SFT recipe for discipline-agnostic scientific hypothesis discovery. github.com/ZonglinY/MOOSE… By @Yang_zy223 & @LidongBing from MiroMind.
English
5
64
154
1.1M
@BioAI_Neuro
@BioAI_Neuro@BioAI_Pharma·
Please check this exciting study from @Yang_zy223 . Here is my summary: 1️⃣ Scalable training for discovery-LLMs We provide the first complexity analysis showing why directly training P(h|b) is intractable, then introduce a decomposed recipe reducing complexity from O(Nᵏ) → O(log N). A post-trained 7B model reaches near-frontier inspiration retrieval accuracy: • MS-7B: 54.4% • Gemini-3 Pro: 54.9% • GPT-5.4: 51.5% • Base 7B: 28.4% 2️⃣ Sample-efficient test-time scaling ~9,500 unguided brute-force samples still can’t match what MS-7B achieves with just 1–3 guided samples. Brute force plateaus early; guided sampling keeps scaling.
Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English
1
2
13
944
Bei Zhang
Bei Zhang@bei_zhang01·
Congrats @Yang_zy223 on #ICML2026! 🎉 MOOSE-Star is the first scalable approach to actually training LLMs for hypothesis generation — not just prompting. Big step for AI-driven scientific discovery. 🔬 arxiv.org/abs/2603.03756
Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English
1
0
1
161
Zonglin Yang
Zonglin Yang@Yang_zy223·
🔬 We post-train LLMs for math, for code, for instruction-following. Why not for scientific discovery? No model has been post-trained specifically for hypothesis generation. MOOSE-Star is a first step, with scaling laws suggesting there's much more to unlock.
MiroMindAI@miromind_ai

🚨 LLM-based scientific hypothesis discovery now has a scalable training recipe. MOOSE-Star, accepted at ICML 2026, enables scalable training for hypothesis generation, with more scalable test-time scaling. By our researchers— x.com/Yang_zy223/sta…

English
0
1
4
272
Leo Dianbo Liu
Leo Dianbo Liu@DianboLiu·
Wonderful work by Zonglin!
Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English
1
0
4
425
Zonglin Yang retweetledi
Suorong Yang@ICML2026
Suorong Yang@ICML2026@Suorong_Yang·
Excited to share that Data Agent has been accepted to #icml2026 @icmlconf 🎉 Data Agent asks: Can a model learn which data it needs during training? Highlights: ✅ Modular reward designs ✅ Very lightweight agent ✅ Plug-and-play across vision models and LLMs
Suorong Yang@ICML2026 tweet media
English
1
2
12
887
Zonglin Yang retweetledi
Hui Chen
Hui Chen@chchenhui·
To what extent do AI-generated papers contain fabrications? 🚀Excited to introduce FabScore for fine-grained evaluation of fabrications in automated AI research. 🧵 We evaluate 144 AI-written papers from multiple sources, including @SakanaAILabs 's AI Scientist, MLR-Bench, @AnalemmaAI 's FARS and the 2025 #Agents4Science Open Conference. Among 54 real conference submissions, we find that approximately 70% contain at least one fabrication; even among accepted papers, the rate remains as high as 59.3%. 📰 Paper: chchenhui.github.io/papers/FabScor… 💻 Code: github.com/chchenhui/fabs… 1/
Hui Chen tweet mediaHui Chen tweet media
English
6
22
106
19.2K
Zonglin Yang
Zonglin Yang@Yang_zy223·
TL;DR — three contributions: 🔬 Theory — first analysis of why training P(h|b) is intractable (combinatorial complexity). 🛠 Training — first recipe that makes training P(h|b) tractable and scalable, with log-linear scaling laws. ⚡ Inference — continuous test-time scaling, breaking the brute-force complexity wall. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF:  huggingface.co/collections/Zo… Joint work with @LidongBing at @miromind_ai 🍅 Happy to discuss in the comments — questions and critiques welcome. #ICML2026 #AI4Science #LLM #AI4Research #Discovery
English
0
2
7
242
Zonglin Yang
Zonglin Yang@Yang_zy223·
Test-time scales — and brute-force collapses 🏔️ The most striking result. We pit brute-force sampling head-to-head against MOOSE-Star (MS): even with ~9,500 unguided samples per case, can brute-force match what MS produces with just 1–3 guided samples? It hits a hard "complexity wall." Brute-force's win rate against MS collapses as required inspirations k grows: 43% → 19% → 0% for k = 1, 2, 3. By k=3, brute-force never wins a single matchup — even at this massive sample budget. Overall, MS wins 61.5% of head-to-heads, with brute-force at just 23.9%. Decomposition turns an intractable discovery problem into a tractable search problem.
Zonglin Yang tweet mediaZonglin Yang tweet media
English
1
1
5
336
Zonglin Yang
Zonglin Yang@Yang_zy223·
Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇
Zonglin Yang tweet media
English
2
15
62
167.2K
Zonglin Yang retweetledi
MiroMindAI
MiroMindAI@miromind_ai·
🚀The MiroThinker API went live! Interactive scaling —up to 300 tool interactions per task × 256K context: 🧠Two engines: ▸ mirothinker-1-7-deepresearch (235B) — GAIA-Val 82.7 · BrowseComp 74.0 · HLE-Text 42.9 ▸ -mini (30B) — BrowseComp-ZH 72.3 (SOTA at 30B) Plus the agent infra goodies to ship them: 🔌 Disconnect-safe execution — submit / resume / cancel without losing work mid-run 📜Full traces on every run — each step, tool call, and decision logged. SFT / DPO-ready out of the box. Pre-freeze billing: if our platform fails, you get a full refund. Cancel mid-run: pay only what compute touched. From $1.25/M input, 25% OFF at launch. 🔑in the comments👇
MiroMindAI tweet mediaMiroMindAI tweet media
English
2
8
16
1.6K
Zonglin Yang retweetledi
Mariya I. Vasileva
Mariya I. Vasileva@mariyaivasileva·
Research shouldn’t turn into a deadline-driven content creation pipeline. Good research in my experience looks nothing like this. You think deeply about a problem, approach it from multiple angles, revise the direction repeatedly, solicit peer feedback, and refine. By iteration N, the paper usually looks quite different from the original idea, and what you thought was “the paper” might end up as just one section of the final result. Writing — specifically paper positioning, findings framing against prior work, visualization of main results, has always taken me even longer to solidify. Plus, peer review is already strained beyond capacity, this practice just makes it low-signal on both sides.
Amit LeVi@AmitLeViAI

Such a great evening to start a brand new research for NeurIPS in 3.5 days.🧘‍♂️ Day 1: planning. Night 1: running experiments and sending the abstract. Day 2: reading results fighting with Claude, and sending again. Night 2: sleep (optional). Day 3: opening Codex, and finally, write the pape in parallel. Night 3: resolving the “beef” with Claude (temporary peace) and going to sleep. Day 4: final reading, last-minute fixes, submission then some relaxation, maybe a beach walk. I’ll keep you posted on the results. This will be my only single-author paper, so I can’t hide behind other submissions if it gets rejected 😅

English
4
18
293
26.3K
Zonglin Yang retweetledi
Zonglin Yang
Zonglin Yang@Yang_zy223·
Fully agreed. In fact, after checking papers in many disciplines, we find that most of the papers are from A+B+C, derive a theory based on it, and use it for automated scientific discovery: arxiv.org/abs/2603.03756 The fundametal assumption is that hypothesis (h) is from composition of research background (b) and multiple inspirations (i).
Zonglin Yang tweet media
English
0
1
2
1.9K
Guanya Shi
Guanya Shi@GuanyaShi·
I’m so tired of writing rebuttals to this kind of “lack of novelty” review: “This paper trivially combines A, B, and C, so the algorithmic novelty is limited.” Technically, most (if not all) robotics papers are convex combinations of existing ideas. I still deeply appreciate A+B+C papers—especially when they deliver: - New capabilities: the “trivial combination” unlocks behaviors we simply couldn’t achieve before - Sensible & organic design: A+B+C is clearly the right composition—not some arbitrary A′+B+C′ - Nontrivial interactions: careful analysis of the dynamics, coupling, or failure modes between A, B, C - Rehabilitating old ideas: A was dismissed for years, but paired with modern B/C, it suddenly works—and teaches us why - System-level & "interface" insight: the contribution is not any single piece, but how the pieces talk to each other - Scaling laws or regimes: identifying when/why A+B+C works (and when it doesn’t) - Engineering clarity: making something actually work robustly in the real world is not “trivial” - New problem formulations: sometimes the real novelty is in the reformulation—only under this view does A+B+C make sense. Maybe worth keeping these in mind when reviewing the next A+B+C paper : )
English
29
122
979
113.5K
Zonglin Yang retweetledi
Pushmeet Kohli
Pushmeet Kohli@pushmeet·
Our AlphaProof paper is in this week’s issue of @Nature! In 2024, @GoogleDeepMind's proof agents AlphaProof & AlphaGeometry together made a substantial leap in AI by achieving the silver-medal standard in solving IMO problems. The Nature paper describes the technical innovations required—in particular, the RL loop bridging natural language & symbolic rigor—that made AlphaProof possible.
Pushmeet Kohli tweet media
English
24
104
720
83.9K
Zonglin Yang retweetledi
Garrett Bingham
Garrett Bingham@gjb_ai·
Aletheia solved six FirstProof problems fully autonomously.
Garrett Bingham tweet media
English
7
46
349
20.6K