Chang Huan retweetledi

🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments.
Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list.
🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users.
Our key contributions:
🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone
📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment
👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval
📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration
📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate
PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with.
Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏

English
