
Ece Kamar
200 posts







NEW paper from Microsoft Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, which discusses lessons learned from building best-in-class verifiers for web tasks. It's built on four principles: non-overlapping rubrics, separate process vs. outcome rewards, distinguishing controllable from uncontrollable failures, and divide-and-conquer context management across full screenshot trajectories. It reduces false positive rates to near zero, down from 45%+ (WebVoyager) and 22%+ (WebJudge). Without reliable verifiers, both benchmarks and training data are corrupted. One interesting finding is that an auto-research agent reached 70% of expert verifier quality in 5% of the time, but couldn't discover the structural design decisions that drove the biggest gains. Human expertise and automated optimization play complementary roles. Paper: arxiv.org/abs/2604.06240 Learn to build effective AI agents in our academy: academy.dair.ai






Inside the Society of Agents: Why AI Teamwork Beats Bigger Models x.com/i/broadcasts/1…


On Saturday, January 24 at 2 PM at #AAAI2026, Ece Kamar (@ecekamar, Microsoft Research) will be giving an invited talk titled "Navigating the AI Horizon: Promises, Perils, and the Power of Collaboration." CC @RealAAAI @hadi_hoss






Magentic Marketplace is a simulation environment for agentic markets, a new project from @Microsoft to explore multi-agent collaboration. Feat. @ecekamar thenewstack.io/microsoft-laun…

