
Excited to share our latest work at COLM 2025: "Model-Agnostic Policy Explanations with Large Language Models", with @SophieYueGuo, Shufei Chen, @SimonStepputtis, @MatthewGombolay, Katia Sycara, and @joecampb. 🤖 What if we could explain a robot's behavior to anyone, in natural language, without needing access to the underlying policy weights? We propose a model-agnostic method which distills an agent's observed behavior into a structured, interpretable surrogate model, amenable for reasoning. This representation then guides an LLM to generate accurate and comprehensible natural language explanations. We demonstrate that our approach: ✅ Significantly reduces hallucination ✅ Outperforms baselines in explanation quality and action prediction ✅ Nearly matches human experts in user studies ❗ And shows that people can’t reliably detect hallucinated explanations, making faithful explanation methods more urgent than ever. 📄 Paper: arxiv.org/abs/2504.05625 Always happy to chat if this intersects with your interests in AI safety, interpretability, or human-AI interaction!






