
(ICML 2025 Spotlight-Top 2.6%) Multi-agent LLM systems still fail—but who caused it, and when? 🔥 We introduce Who&When, the first benchmark for automated failure attribution in LLM multi-agent systems. We also define a new task: automatically identifying which agent failed, and at what step. Paper: arxiv.org/abs/2505.00212 Code: github.com/mingyin1/Agent… Dataset: huggingface.co/datasets/Kevin… 👀 Why automated failure attribution matters: LLM agent systems are evaluated, fail, and manually debugged by tracing long logs. This process is: – labor-intensive – error-prone – poorly scalable We ask: Can LLMs do the diagnosis themselves? 📦 Who&When Benchmark We curated a dataset of 127 multi-agent systems from GAIA and AssistantBench, with rich annotations: – failure-responsible agent ("Who") – decisive error step ("When") – natural language explanation ("Why") Both algorithm-generated & hand-crafted systems included. 🔍 We studied 3 failure attribution methods: All-at-Once → Full log judged in one go Step-by-Step → Log analyzed turn-by-turn Binary Search → Halves the log recursively until finding the culprit Each has tradeoffs in accuracy, cost, and granularity. ⚖️ Key Findings: – All-at-Once is best at identifying the responsible agent – Step-by-Step is best at identifying the specific step – Binary Search is a middle ground – Combining methods gives the best of both worlds — but increases token cost 📈 🌟 Takeaway: Automated Failure attribution is the missing link between evaluation and improvement in multi-agent systems. We hope our dataset and findings spark new work in making LLM agents not only act, but also diagnose and learn from their own failures. Very grateful to work with all authors! We are an amazing team. @MingYin17762943 @JieyuZhang20 @Jiale_Leo @Zhiguang Han @BeFunky2345 @beibin79 @Chi_Wang_ @huazheng_wang @Yiran Chen @qingyun_wu (If you find this project helpful, please consider giving us a ⭐️! Thanks!) #LLMs #ICML2025 #AgenticAI #AG2 #Google #Meta















