

UC Santa Barbara NLP Group
285 posts

@ucsbNLP
NLP and AI Researchers @ucsantabarbara. Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, Xifeng Yan, Simon Todd, @WenboGuo4.





Agent skills are becoming a popular way to extend LLM agents with reusable, domain-specific knowledge, but how well do they actually work when agents must find and use skills on their own? To answer this question, we collect 34k real-world skills from open-source repos and build a retrieval system over them. We then evaluate skill utility under progressively realistic settings, from curated skills directly given to agents, to retrieving from the full 34k collection, to settings where no task-specific skill even exists.🧵


Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper🎉👇 aclanthology.org/2026.eacl-long…




Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper🎉👇 aclanthology.org/2026.eacl-long…


🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏









🚀 Introducing TermiGen: Closing the Gap Between Open & Proprietary Terminal Agents Terminal tasks — system admin, DevOps, security forensics — are where AI agents meet the real world. And most open models fail here. Today, we're releasing: ✅ 32B model achieving 31.3% on TerminalBench 1.0 (new open-weight 32B SOTA) ✅ Beating o4-mini with Codex by 11.3% ✅ 3,500+ verified Docker environments and tasks (covering 420 unique bash tools across 11 categories): 🛠️ System admin & DevOps (Docker, K8s, systemd) 🔐 Security & reverse engineering (Ghidra, Wireshark) 🧪 Scientific computing (samtools, GROMACS) 📊 + 8 more (ML, data processing, formal methods...) 📄 Paper: arxiv.org/abs/2602.07274 💻 3500+ Env: github.com/ucsb-mlsec/ter… 🤖 Model: huggingface.co/UCSB-SURFI/Ter…



