Zhen Wang@zhenwang9102
🤖🔬 Can AI actually do science end-to-end?
🧠📈 And how would we know when it matches, or surpasses, humans?
⚡🧪 AI is rapidly automating scientific discovery, but benchmarking full-cycle discovery, from 💡 ideation → 🧑💻 execution → 📊 conclusions, remains unsolved:
🧐🧐🧐
❌🛠️ Open-ended discovery → manual validation (costly, unscalable)
❌📏 Metric-driven benchmarks (e.g., MLE-Bench) → convenient but narrow (is higher accuracy really enough?)
❌🤖⚖️ LLM-as-judge → useful, but fundamentally risky if used alone
🔥🚀 Introducing FIRE-Bench🔥: Fullcycle Insight Rediscovery Evaluation
👉🌐 firebench.github.io
📚✨ A benchmark that turns fresh, human-verified insights from recent 🏆 NeurIPS / ICLR / ICML papers into masked, end-to-end discovery challenges 🧩
🌍🔐 Constrained open-ended discovery–backed by ground truth.
📌 Key takeaways:
1⃣ 📖🧱 Reference-based evaluation still matters: constrained LLM judging helps, but human-grounded references remain essential until agents can consistently match human conclusions
2⃣ 🏆🧠 Expert-validated ground truth: all tasks come from recent NeurIPS / ICLR / ICML papers, with contamination carefully controlled
3⃣ 🔁🎭 Rediscovery, not reproduction: original 🧪 methods, 📊 experiments, 💻 implementations, and 📈 analyses are fully masked to create real discovery challenges
🔑 Key empirical findings:
💡 The "Science Gap" is Real: Even the best setup (Claude Code + Sonnet-4) caps out at an F1 score of 46.7. On hard tasks, agents struggle to break 30
💡 Success is a "Lottery": Performance has incredibly high variance. Reliability is a major unsolved issue.
💡 Coding is no longer the bottleneck; high-level reasoning and analysis are: ~74% of errors stem from flawed planning, not coding
⚙️ How it works:
🔹 Research-Problem Trees: We parse papers into trees (from broad roots to concrete leaves). This allows us to select intermediate nodes that perfectly balance open-ended exploration with verifiable ground truth.
🔹 Claim-Level Evaluation: We match AI conclusions against human conclusions using granular claim decomposition (F1 score).
🔹 Creativity Check: We score false positives to see if agents are finding novel truths (Spoiler🚨: they aren’t creative yet).
🔹 New Diagnostic Taxonomy: failures traced across four stages: 🧠 Planning → 🛠️ Implementation → ▶️ Execution → 🧾 Conclusion
🔹 Additional Analyses: cost efficiency, contamination checks, and more.
👀 The Future:
🚀 Live-FIRE-Bench: a live, continuously updated FIRE-Bench to track real-time progress on the latest research (Newest LLMs should be benchmarked with the newest research)
🚀 Stronger scaffolding (search + planning + coding) 🧠🧰 and converting FIRE-Bench into interactive environments for training research agents
🚀 Toward real creativity: We want better systems that can produce genuinely novel conclusions toward creativity 🎨⏳
🚀 Better systems 🧠✨ and better benchmarks 📏 must co-evolve 🔄 over time
📜🎥 Paper, video, demo, and research trees:
👉🌐 firebench.github.io
#AI 🤖 #MachineLearning 📚 #AI4Science 🔬 #LLMs 🧠 #Research 🧪 #AgenticAI 🚀 #FireBench 🔥