

Martin Gubri
1.4K posts

@framart1
Research Lead @parameterlab working on Trustworthy AI | he/him Other accounts: 🦋 mgubri | 🐘 @[email protected]





The Datasets & Benchmarks track is now "Evaluation and Datasets", with an expanded scope for NeurIPS 2026! Read the call for papers neurips.cc/Conferences/20…, and learn more about the changes in our blog post: blog.neurips.cc/2026/03/23/int…

1/ Evaluating a single agent harness is hard. Evaluating a multi-agent system? That's a whole different problem. Most eval tools treat the model as the unit of analysis. But in multi-agent systems, the system is what matters. That's why we built MASEval 🧵 #Agents #AI #Eval

🚨 Fine-tuning your model to be more helpful or empathetic might be making it less private, without you noticing. In our latest work, we show that benign fine-tuning can silently break contextual privacy in language models while safety & general capabilities appear intact. ⬇️




🚨 Fine-tuning your model to be more helpful or empathetic might be making it less private, without you noticing. In our latest work, we show that benign fine-tuning can silently break contextual privacy in language models while safety & general capabilities appear intact. ⬇️

🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost. In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance. Project page: arubique.github.io/disco-site/ More below 🧵👇

Dr.LLM: Dynamic Layer Routing in LLMs Neat technique to reduce computation in LLMs while improving accuracy. Routers increase accuracy while reducing layers by roughly 3 to 11 per query. My notes below:


