UC Santa Barbara NLP Group
290 posts

UC Santa Barbara NLP Group
@ucsbNLP
NLP and AI Researchers @ucsantabarbara. Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, Xifeng Yan, Simon Todd, @WenboGuo4.



I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments. (link in comment)

๐จNew Paper out! Planning to Explore: Curiosity-Driven Planning for LLM Test Generation. We formalize LLM test generation as Bayesian exploration and show that planning-aware methods outperform greedy approaches by a large margin on branch coverage ๐งตโฌ๏ธ





Agent skills are becoming a popular way to extend LLM agents with reusable, domain-specific knowledge, but how well do they actually work when agents must find and use skills on their own? To answer this question, we collect 34k real-world skills from open-source repos and build a retrieval system over them. We then evaluate skill utility under progressively realistic settings, from curated skills directly given to agents, to retrieving from the full 34k collection, to settings where no task-specific skill even exists.๐งต


Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐๐ aclanthology.org/2026.eacl-longโฆ




Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐๐ aclanthology.org/2026.eacl-longโฆ


๐ Excited to share ๐ PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. ๐ง Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: ๐ PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone ๐ฑ Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment ๐ Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval ๐ PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration ๐ Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon ๐










