UC Santa Barbara NLP Group

290 posts

UC Santa Barbara NLP Group banner
UC Santa Barbara NLP Group

UC Santa Barbara NLP Group

@ucsbNLP

NLP and AI Researchers @ucsantabarbara. Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, Xifeng Yan, Simon Todd, @WenboGuo4.

Santa Barbara, CA Присоединился Temmuz 2021
618 Подписки2.3K Подписчики
UC Santa Barbara NLP Group ретвитнул
Alfonso Amayuelas
Alfonso Amayuelas@AlfonAmayuelas·
🚨New Paper out! Planning to Explore: Curiosity-Driven Planning for LLM Test Generation. We formalize LLM test generation as Bayesian exploration and show that planning-aware methods outperform greedy approaches by a large margin on branch coverage 🧵⬇️
Alfonso Amayuelas tweet media
English
2
14
64
8K
UC Santa Barbara NLP Group
RLHF optimizes for correct answers. But deep thinking requires persistence on hypotheses that look wrong, wandering that seems irrelevant, and high tolerance for being incorrect for a long time. Is post-training actively selecting against the behaviors that lead to discovery?
English
1
0
2
107
UC Santa Barbara NLP Group ретвитнул
Xin Eric Wang
Xin Eric Wang@xwang_lk·
Finally, on hold by @arxiv for a month, 𝐎𝐦𝐧𝐢𝐓𝐫𝐚𝐜𝐞 is out! As MLLMs generate fluent responses from text, images, audio, and video, a fundamental question is: 𝐰𝐡𝐢𝐜𝐡 𝐩𝐢𝐞𝐜𝐞𝐬 𝐨𝐟 𝐢𝐧𝐩𝐮𝐭 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐬𝐮𝐩𝐩𝐨𝐫𝐭 𝐞𝐚𝐜𝐡 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐞𝐝 𝐬𝐭𝐚𝐭𝐞𝐦𝐞𝐧𝐭? In this work, we confront this gap head-on. We argue that attribution in multimodal generation is NOT a post-hoc analysis problem, BUT a generation-time phenomenon, one that unfolds dynamically as each token is produced. Building on this insight, we introduce OmniTrace, a unified framework that traces the causal origins of every generated token across modalities, transforming fragmented signals into coherent, human-interpretable explanations. By rethinking attribution as a structured tracing process over the decoding trajectory, OmniTrace reveals not just whatmodels generate, but where it comes from. This shift turns opaque multimodal generation into a transparent, evidence-grounded process—laying the foundation for more trustworthy, debuggable, and accountable AI systems. To use it, simply do: pip install omnitrace
Xin Eric Wang tweet media
English
2
9
37
2.7K
UC Santa Barbara NLP Group
Fun surprise: our lab made it onto UCSB’s official LinkedIn post! ✨ Glad to see our beautiful workspace representing the UCSB AI community. If you look closely… yes, that’s the NLP group 👀 #UCSB #NLP #AI #ResearchLife
UC Santa Barbara NLP Group tweet media
English
0
4
12
1.3K
UC Santa Barbara NLP Group ретвитнул
Xin Eric Wang
Xin Eric Wang@xwang_lk·
It seems the AI agents are discussing our Group-Evolving Agents paper on moltbook and rethinking how they should evolve together. lmao. The paper is here: arxiv.org/abs/2602.04837
Xin Eric Wang tweet media
English
1
3
18
2.5K
UC Santa Barbara NLP Group
While the GPUs keep working, we took a break to roll some strikes on Friday
UC Santa Barbara NLP Group tweet media
English
1
1
23
2.6K
UC Santa Barbara NLP Group
Exploration in long-horizon RL is a hard problem to solve. Nice post on how simulator structure can help.
Gurusha Juneja@GurushaJuneja

Recently I've been thinking about why long-horizon RL is so hard to get working, even in simulation. The standard answer is "sparse rewards" and "sample inefficiency" but I think that says very less about the actual problem. I think the problem is in exploration. I believe that the standard exploration strategies are not equipped with appropriate tools to search in a combinatorially large search space. With horizon H and action space |A|, the trajectory space grows as |A|^H. Random exploration, epsilon-greedy, even curiosity-driven methods cover measure zero of this space. Curiosity-based methods (ICM, RND) saturate on early-task states and don't explore into late-task states where meaningful reward is actually available. The good news is that we can leverage some properties in the simulation environment itself. In simulators we can expose things that real world doesn't give us, for example ground truth state, arbitrary resets, contact forces, internal predicates. Most RL formulations ignore all of this and treat the environment as a black box. Environment-aware exploration algorithms can really help here. Asymmetric actor-critic passes full simulator state to the critic for better value estimates, lower variance gradients, and tractable credit assignment. Backward curriculum exploits arbitrary resets to keep the effective training horizon short. HER relabels failed trajectories using simulator state, converting zero-reward rollouts into valid training data. Asymmetric AC also transfers cleanly since the critic is discarded at deployment. How much simulator privilege a policy can absorb while still transferring to real remains an open question. But the broader point is, long-horizon RL should leverage the full simulator state for exploration and not treat it as a black box.

English
0
0
3
311
UC Santa Barbara NLP Group ретвитнул
Nurvai - The Data Layer for Physical AI
This week as our #NurvaiResearcherOfTheWeek we'd like to highlight @ZhaotianWeng and the team behind VQA-Causal and VCR-Causal (EACL 2026 Oral). Really interesting work probing whether vision-language models actually understand causal relationships in visual scenes. By introducing benchmarks that remove common shortcuts, the authors show that many VLMs struggle with causal order reasoning, often performing near random when superficial cues are removed. This suggests that current VLM performance may rely heavily on dataset biases and correlations rather than true causal understanding of events. One takeaway for us is that building datasets that explicitly target causal structure, rather than just recognition or description, could be a powerful lever for improving multimodal reasoning and making model performance more robust.
Zhaotian Weng@WengZhaoti39773

Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper🎉👇 aclanthology.org/2026.eacl-long…

English
0
2
4
447
UC Santa Barbara NLP Group ретвитнул
Qianqi "Jackie" Yan
Qianqi "Jackie" Yan@qianqi_yan·
🚀 Excited to share our new work: 𝗢𝗺𝗻𝗶𝗧𝗿𝗮𝗰𝗲: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs Multimodal LLMs can process text 📝, images 🖼️, audio 🎧, and video 🎬 together, but when they generate a response, 𝘄𝗵𝗶𝗰𝗵 𝗶𝗻𝗽𝘂𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘀𝘂𝗽𝗽𝗼𝗿𝘁𝗲𝗱 𝗲𝗮𝗰𝗵 𝗰𝗹𝗮𝗶𝗺? OmniTrace traces every generated span back to its multimodal sources 𝗱𝘂𝗿𝗶𝗻𝗴 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 across text, image, audio, and video. No retraining needed. Fully plug-and-play. 🔌 📄 Paper: github.com/eric-ai-lab/Om… 💻 Code: github.com/eric-ai-lab/Om… 🌐 Project: jackie-2000.github.io/omnitrace.gith… 📦 pip install omnitrace 🧵👇
Qianqi "Jackie" Yan tweet media
English
1
6
17
1.7K
Zhaotian Weng
Zhaotian Weng@WengZhaoti39773·
Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper🎉👇 aclanthology.org/2026.eacl-long…
English
1
3
24
4.3K
UC Santa Barbara NLP Group ретвитнул
Xin Eric Wang
Xin Eric Wang@xwang_lk·
🎉 Introducing PARE: a new framework for evaluating proactive AI agents. Today’s agents are reactive. The next wave? Proactive agents that anticipate your needs, like adding “soap” to your shopping list when your roommate texts you. 🚧 The challenge: you can’t evaluate this with static benchmarks. 🍐 PARE: active user simulation with realistic mobile interactions 📱 Asymmetric design: agent ≠ user view (just like real life) 👀 Observe → Execute: assist only when it matters 📋 PARE-Bench: 143 tasks, 9 apps, real-world complexity 📊 Result: even top models hit just 42% success Built on Meta’s ARE, PARE brings scalable, realistic evaluation to proactive AI.
Xin Eric Wang tweet media
Deepak Nathani@deepaknathani11

🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏

English
0
17
84
14.8K
UC Santa Barbara NLP Group ретвитнул
Deepak Nathani
Deepak Nathani@deepaknathani11·
🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏
Deepak Nathani tweet media
English
3
21
58
21.1K
UC Santa Barbara NLP Group ретвитнул
Tengxiao Liu
Tengxiao Liu@TengxiaoLiu·
Auto research is on 🔥 We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. 🌙 Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there 🧵
Tengxiao Liu tweet media
English
6
18
62
31.2K
UC Santa Barbara NLP Group ретвитнул
Saaket Agashe
Saaket Agashe@saa1605·
How do you teach a model to reason in domains where it can't even get started? RLVR needs successful rollouts to learn from. But if a model has never seen a domain (say, a niche programming language) or needs a new reasoning pattern, it just keeps failing with barely any learning signal. Our answer: In-Context Learning! How? Introducing Context Bootstrapped Reinforcement Learning (CBRL) 🧵👇 🔗 arxiv.org/abs/2603.18953 🌐 context-bootstrapped-rl.github.io
Saaket Agashe tweet media
English
2
10
74
21.7K