UC Santa Barbara NLP Group
284 posts

UC Santa Barbara NLP Group
@ucsbNLP
NLP and AI Researchers @ucsantabarbara. Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, Xifeng Yan, Simon Todd, @WenboGuo4.



Agent skills are becoming a popular way to extend LLM agents with reusable, domain-specific knowledge, but how well do they actually work when agents must find and use skills on their own? To answer this question, we collect 34k real-world skills from open-source repos and build a retrieval system over them. We then evaluate skill utility under progressively realistic settings, from curated skills directly given to agents, to retrieving from the full 34k collection, to settings where no task-specific skill even exists.๐งต


Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐๐ aclanthology.org/2026.eacl-longโฆ




Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐๐ aclanthology.org/2026.eacl-longโฆ


๐ Excited to share ๐ PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. ๐ง Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: ๐ PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone ๐ฑ Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment ๐ Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval ๐ PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration ๐ Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon ๐









๐ Introducing TermiGen: Closing the Gap Between Open & Proprietary Terminal Agents Terminal tasks โ system admin, DevOps, security forensics โ are where AI agents meet the real world. And most open models fail here. Today, we're releasing: โ 32B model achieving 31.3% on TerminalBench 1.0 (new open-weight 32B SOTA) โ Beating o4-mini with Codex by 11.3% โ 3,500+ verified Docker environments and tasks (covering 420 unique bash tools across 11 categories): ๐ ๏ธ System admin & DevOps (Docker, K8s, systemd) ๐ Security & reverse engineering (Ghidra, Wireshark) ๐งช Scientific computing (samtools, GROMACS) ๐ + 8 more (ML, data processing, formal methods...) ๐ Paper: arxiv.org/abs/2602.07274 ๐ป 3500+ Env: github.com/ucsb-mlsec/terโฆ ๐ค Model: huggingface.co/UCSB-SURFI/Terโฆ






๐จ New paper alert ๐จย ๐ How can we make GUI grounding models reliable in real-world interactions?ย We introduce ๐ SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration In GUI agents, a single wrong click isnโt just an error โ it can trigger costly or irreversible actions (e.g., unintended payments ๐ธ or deleting important files ๐๏ธ).ย The real danger is silent failure: most GUI grounding models always output a coordinate, even when theyโre unsure.ย Instead of trusting a single predicted point, SafeGround:ย โข estimates spatial uncertainty from prediction variabilityย โข calibrates a decision threshold with statistical guaranteesย โข enables risk-controlled GUI actions, even with black-box modelsย ๐ป Code: github.com/Cece1031/SAFEGโฆย ๐ Paper: arxiv.org/pdf/2602.02419 ๐งต1/6 #Agents #GUI



