Zonglin Yang (@Yang_zy223) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English

2

15

62

167.2K

Zonglin Yang retweetledi

MiroMindAI@miromind_ai·10h

We post-train LLMs for math, for code, for instruction-following. Why not for scientific discovery? 🫎 MOOSE-Star (ICML 2026) : the first scalable SFT recipe for discipline-agnostic scientific hypothesis discovery. github.com/ZonglinY/MOOSE… By @Yang_zy223 & @LidongBing from MiroMind.

English

5

64

154

1.1M

Zonglin Yang@Yang_zy223·1d

@BioAI_Pharma Thanks for sharing! The models and experiment results are fully available and reproducible! GitHub: github.com/ZonglinY/MOOSE… HuggingFace: huggingface.co/collections/Zo…

English

0

2

69

@BioAI_Neuro@BioAI_Pharma·1d

Please check this exciting study from @Yang_zy223 . Here is my summary: 1️⃣ Scalable training for discovery-LLMs We provide the first complexity analysis showing why directly training P(h|b) is intractable, then introduce a decomposed recipe reducing complexity from O(Nᵏ) → O(log N). A post-trained 7B model reaches near-frontier inspiration retrieval accuracy: • MS-7B: 54.4% • Gemini-3 Pro: 54.9% • GPT-5.4: 51.5% • Base 7B: 28.4% 2️⃣ Sample-efficient test-time scaling ~9,500 unguided brute-force samples still can’t match what MS-7B achieves with just 1–3 guided samples. Brute force plateaus early; guided sampling keeps scaling.

Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English

1

2

13

944

Zonglin Yang@Yang_zy223·3d

@bei_zhang01 Thanks, Bei!

English

0

38

Bei Zhang@bei_zhang01·3d

Congrats @Yang_zy223 on #ICML2026! 🎉 MOOSE-Star is the first scalable approach to actually training LLMs for hypothesis generation — not just prompting. Big step for AI-driven scientific discovery. 🔬 arxiv.org/abs/2603.03756

Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English

1

0

1

161

Zonglin Yang@Yang_zy223·4d

🔬 We post-train LLMs for math, for code, for instruction-following. Why not for scientific discovery? No model has been post-trained specifically for hypothesis generation. MOOSE-Star is a first step, with scaling laws suggesting there's much more to unlock.

MiroMindAI@miromind_ai

🚨 LLM-based scientific hypothesis discovery now has a scalable training recipe. MOOSE-Star, accepted at ICML 2026, enables scalable training for hypothesis generation, with more scalable test-time scaling. By our researchers— x.com/Yang_zy223/sta…

English

0

1

4

272

Zonglin Yang retweetledi

Ethan Xu@LinjieXu·7 May

Our latest work has been accepted as a regular paper by ICML 2026. Can't wait to see many old/new friends in Seoul. arxiv.org/abs/2602.13697

Ethan Xu@LinjieXu

(1/3) Enterprise RDBs rarely change their structure, but meet new ML tasks every day. The RDB foundation model (FM) fits this position well because no task-specific training is needed. Our latest work uses intra-column encoding and tabular FMs, achieving SOTA performance.

English

0

1

274

Zonglin Yang@Yang_zy223·4d

@DianboLiu Thanks Prof. Liu! 🙏

English

0

35

Leo Dianbo Liu@DianboLiu·4d

Wonderful work by Zonglin!

Zonglin Yang@Yang_zy223

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English

1

0

4

425

Zonglin Yang retweetledi

Suorong Yang@ICML2026@Suorong_Yang·5d

Excited to share that Data Agent has been accepted to #icml2026 @icmlconf 🎉 Data Agent asks: Can a model learn which data it needs during training? Highlights: ✅ Modular reward designs ✅ Very lightweight agent ✅ Plug-and-play across vision models and LLMs

English

1

2

12

887

Zonglin Yang retweetledi

Hui Chen@chchenhui·6d

To what extent do AI-generated papers contain fabrications? 🚀Excited to introduce FabScore for fine-grained evaluation of fabrications in automated AI research. 🧵 We evaluate 144 AI-written papers from multiple sources, including @SakanaAILabs 's AI Scientist, MLR-Bench, @AnalemmaAI 's FARS and the 2025 #Agents4Science Open Conference. Among 54 real conference submissions, we find that approximately 70% contain at least one fabrication; even among accepted papers, the rate remains as high as 59.3%. 📰 Paper: chchenhui.github.io/papers/FabScor… 💻 Code: github.com/chchenhui/fabs… 1/

English

6

22

106

19.2K

Zonglin Yang@Yang_zy223·5d

TL;DR — three contributions: 🔬 Theory — first analysis of why training P(h|b) is intractable (combinatorial complexity). 🛠 Training — first recipe that makes training P(h|b) tractable and scalable, with log-linear scaling laws. ⚡ Inference — continuous test-time scaling, breaking the brute-force complexity wall. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… Joint work with @LidongBing at @miromind_ai 🍅 Happy to discuss in the comments — questions and critiques welcome. #ICML2026 #AI4Science #LLM #AI4Research #Discovery

English

0

2

7

242

Zonglin Yang@Yang_zy223·5d

Test-time scales — and brute-force collapses 🏔️ The most striking result. We pit brute-force sampling head-to-head against MOOSE-Star (MS): even with ~9,500 unguided samples per case, can brute-force match what MS produces with just 1–3 guided samples? It hits a hard "complexity wall." Brute-force's win rate against MS collapses as required inspirations k grows: 43% → 19% → 0% for k = 1, 2, 3. By k=3, brute-force never wins a single matchup — even at this massive sample budget. Overall, MS wins 61.5% of head-to-heads, with brute-force at just 23.9%. Decomposition turns an intractable discovery problem into a tractable search problem.

English

1

5

336

Zonglin Yang@Yang_zy223·5d

Can we actually TRAIN LLMs for scientific discovery — or only prompt them to brainstorm? 🧬✨ 🎉 MOOSE-Star → #ICML2026 Most work on LLMs for hypothesis discovery focuses on inference-time agents or feedback-driven refinement. The core generative process — P(hypothesis | research background), or P(h|b) — has been largely sidestepped: directly training it remains an open problem. We show why: a combinatorial complexity barrier makes naive end-to-end training mathematically intractable. First scalable recipe for training P(h|b), with clean scaling laws on both training data and test-time compute. 📄 Paper: arxiv.org/abs/2603.03756 💻 GitHub: github.com/ZonglinY/MOOSE… 🤗 HF: huggingface.co/collections/Zo… 🧵👇

English

2

15

62

167.2K

Zonglin Yang retweetledi

MiroMindAI@miromind_ai·11 May

🚀The MiroThinker API went live! Interactive scaling —up to 300 tool interactions per task × 256K context: 🧠Two engines: ▸ mirothinker-1-7-deepresearch (235B) — GAIA-Val 82.7 · BrowseComp 74.0 · HLE-Text 42.9 ▸ -mini (30B) — BrowseComp-ZH 72.3 (SOTA at 30B) Plus the agent infra goodies to ship them: 🔌 Disconnect-safe execution — submit / resume / cancel without losing work mid-run 📜Full traces on every run — each step, tool call, and decision logged. SFT / DPO-ready out of the box. Pre-freeze billing: if our platform fails, you get a full refund. Cancel mid-run: pay only what compute touched. From $1.25/M input, 25% OFF at launch. 🔑in the comments👇

English

2

8

16

1.6K

Zonglin Yang retweetledi

Mariya I. Vasileva@mariyaivasileva·4 May

Research shouldn’t turn into a deadline-driven content creation pipeline. Good research in my experience looks nothing like this. You think deeply about a problem, approach it from multiple angles, revise the direction repeatedly, solicit peer feedback, and refine. By iteration N, the paper usually looks quite different from the original idea, and what you thought was “the paper” might end up as just one section of the final result. Writing — specifically paper positioning, findings framing against prior work, visualization of main results, has always taken me even longer to solidify. Plus, peer review is already strained beyond capacity, this practice just makes it low-signal on both sides.

Amit LeVi@AmitLeViAI

Such a great evening to start a brand new research for NeurIPS in 3.5 days.🧘‍♂️ Day 1: planning. Night 1: running experiments and sending the abstract. Day 2: reading results fighting with Claude, and sending again. Night 2: sleep (optional). Day 3: opening Codex, and finally, write the pape in parallel. Night 3: resolving the “beef” with Claude (temporary peace) and going to sleep. Day 4: final reading, last-minute fixes, submission then some relaxation, maybe a beach walk. I’ll keep you posted on the results. This will be my only single-author paper, so I can’t hide behind other submissions if it gets rejected 😅

English

4

18

293

26.3K

Zonglin Yang retweetledi

Nathan Lambert@natolambert·30 Nis

PhD students are normally known by their 1 biggest papers. It’ll be “oh you’re the X guy”.

Xiuyu Li@sheriyuo

For AI PhDs aiming for industry, paper count matters, but only up to a point. In China, 2 to 3 (co)first author CCF-A papers is often the borderline for a Top Talent offer. Beyond that, the marginal gain drops fast. When you apply as a fresh grad, what matters more is whether you have matched experience in a big tech foundation model team. As a PhD, papers can feel like a huge part of the world. After graduation, people see it differently. And for CS PhDs, AI and LLMs are only a small slice. Many groups do not even send students to industry internships the way LLM teams do, and industry itself is much bigger than LLMs. Paper is only one part of you. Your experience matters more. The LLM boom is a winner takes all arena shaped by extreme competition, where only the hardest driving survivors make it to the top. LLM 就是卷生卷死卷出来的幸存者盛世啊

English

9

22

562

122.7K

Zonglin Yang@Yang_zy223·25 Mar

Fully agreed. In fact, after checking papers in many disciplines, we find that most of the papers are from A+B+C, derive a theory based on it, and use it for automated scientific discovery: arxiv.org/abs/2603.03756 The fundametal assumption is that hypothesis (h) is from composition of research background (b) and multiple inspirations (i).

English

0

1

2

1.9K

Guanya Shi@GuanyaShi·25 Mar

I’m so tired of writing rebuttals to this kind of “lack of novelty” review: “This paper trivially combines A, B, and C, so the algorithmic novelty is limited.” Technically, most (if not all) robotics papers are convex combinations of existing ideas. I still deeply appreciate A+B+C papers—especially when they deliver: - New capabilities: the “trivial combination” unlocks behaviors we simply couldn’t achieve before - Sensible & organic design: A+B+C is clearly the right composition—not some arbitrary A′+B+C′ - Nontrivial interactions: careful analysis of the dynamics, coupling, or failure modes between A, B, C - Rehabilitating old ideas: A was dismissed for years, but paired with modern B/C, it suddenly works—and teaches us why - System-level & "interface" insight: the contribution is not any single piece, but how the pieces talk to each other - Scaling laws or regimes: identifying when/why A+B+C works (and when it doesn’t) - Engineering clarity: making something actually work robustly in the real world is not “trivial” - New problem formulations: sometimes the real novelty is in the reformulation—only under this view does A+B+C make sense. Maybe worth keeping these in mind when reviewing the next A+B+C paper : )

English

29

122

979

113.5K

Zonglin Yang retweetledi

Pushmeet Kohli@pushmeet·20 Mar

Our AlphaProof paper is in this week’s issue of @Nature! In 2024, @GoogleDeepMind's proof agents AlphaProof & AlphaGeometry together made a substantial leap in AI by achieving the silver-medal standard in solving IMO problems. The Nature paper describes the technical innovations required—in particular, the RL loop bridging natural language & symbolic rigor—that made AlphaProof possible.

English

24

104

720

83.9K

Zonglin Yang retweetledi

Garrett Bingham@gjb_ai·25 Şub

Aletheia solved six FirstProof problems fully autonomously.

English

7

46

349

20.6K

Zonglin Yang

Keşfet