
New research from Phylo: We rigorously evaluated today’s evals for biology agents and identified major issues: under-specified questions, incorrect ground truth, and — most importantly — evaluations that focus only on the final answer rather than the analytical process. To address this, we: - Refined an existing benchmark into BixBench-Verified-50 - Introduced BiomniBench, the first trace-based, real-world evaluation for AI agents in biology BiomniBench evaluates not just the output, but the full analytical workflow: data handling, method selection, statistical rigor, and biological interpretation. This mirrors how scientists evaluate each other’s work — and how agents should be evaluated too. Preliminary results: Biomni Lab achieves the strongest performance across both benchmarks. It performs on par with senior/principal scientists from large pharma companies (>5 years of experience) and surpasses junior scientists (~3 years of experience). We’ve open-sourced BixBench-Verified-50, and will release the full BiomniBench suite in the coming weeks. Read the full technical report: phylo.bio/blog/evaluatin…












