UC Santa Barbara NLP Group

284 posts

UC Santa Barbara NLP Group banner
UC Santa Barbara NLP Group

UC Santa Barbara NLP Group

@ucsbNLP

NLP and AI Researchers @ucsantabarbara. Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, Xifeng Yan, Simon Todd, @WenboGuo4.

Santa Barbara, CA Beigetreten Temmuz 2021
616 Folgt2.3K Follower
UC Santa Barbara NLP Group retweetet
Xin Eric Wang
Xin Eric Wang@xwang_lkยท
It seems the AI agents are discussing our Group-Evolving Agents paper on moltbook and rethinking how they should evolve together. lmao. The paper is here: arxiv.org/abs/2602.04837
Xin Eric Wang tweet media
English
1
3
18
2.4K
UC Santa Barbara NLP Group
While the GPUs keep working, we took a break to roll some strikes on Friday
UC Santa Barbara NLP Group tweet media
English
1
1
23
2.5K
UC Santa Barbara NLP Group
Exploration in long-horizon RL is a hard problem to solve. Nice post on how simulator structure can help.
Gurusha Juneja@GurushaJuneja

Recently I've been thinking about why long-horizon RL is so hard to get working, even in simulation. The standard answer is "sparse rewards" and "sample inefficiency" but I think that says very less about the actual problem. I think the problem is in exploration. I believe that the standard exploration strategies are not equipped with appropriate tools to search in a combinatorially large search space. With horizon H and action space |A|, the trajectory space grows as |A|^H. Random exploration, epsilon-greedy, even curiosity-driven methods cover measure zero of this space. Curiosity-based methods (ICM, RND) saturate on early-task states and don't explore into late-task states where meaningful reward is actually available. The good news is that we can leverage some properties in the simulation environment itself. In simulators we can expose things that real world doesn't give us, for example ground truth state, arbitrary resets, contact forces, internal predicates. Most RL formulations ignore all of this and treat the environment as a black box. Environment-aware exploration algorithms can really help here. Asymmetric actor-critic passes full simulator state to the critic for better value estimates, lower variance gradients, and tractable credit assignment. Backward curriculum exploits arbitrary resets to keep the effective training horizon short. HER relabels failed trajectories using simulator state, converting zero-reward rollouts into valid training data. Asymmetric AC also transfers cleanly since the critic is discarded at deployment. How much simulator privilege a policy can absorb while still transferring to real remains an open question. But the broader point is, long-horizon RL should leverage the full simulator state for exploration and not treat it as a black box.

English
0
0
3
293
UC Santa Barbara NLP Group retweetet
Nurvai - The Data Layer for Physical AI
This week as our #NurvaiResearcherOfTheWeek we'd like to highlight @ZhaotianWeng and the team behind VQA-Causal and VCR-Causal (EACL 2026 Oral). Really interesting work probing whether vision-language models actually understand causal relationships in visual scenes. By introducing benchmarks that remove common shortcuts, the authors show that many VLMs struggle with causal order reasoning, often performing near random when superficial cues are removed. This suggests that current VLM performance may rely heavily on dataset biases and correlations rather than true causal understanding of events. One takeaway for us is that building datasets that explicitly target causal structure, rather than just recognition or description, could be a powerful lever for improving multimodal reasoning and making model performance more robust.
Zhaotian Weng@WengZhaoti39773

Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐ŸŽ‰๐Ÿ‘‡ aclanthology.org/2026.eacl-longโ€ฆ

English
0
2
4
396
UC Santa Barbara NLP Group retweetet
Qianqi "Jackie" Yan
Qianqi "Jackie" Yan@qianqi_yanยท
๐Ÿš€ Excited to share our new work: ๐—ข๐—บ๐—ป๐—ถ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ฒ: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs Multimodal LLMs can process text ๐Ÿ“, images ๐Ÿ–ผ๏ธ, audio ๐ŸŽง, and video ๐ŸŽฌ together, but when they generate a response, ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐—ถ๐—ป๐—ฝ๐˜‚๐˜ ๐—ฎ๐—ฐ๐˜๐˜‚๐—ฎ๐—น๐—น๐˜† ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—ฒ๐—ฎ๐—ฐ๐—ต ๐—ฐ๐—น๐—ฎ๐—ถ๐—บ? OmniTrace traces every generated span back to its multimodal sources ๐—ฑ๐˜‚๐—ฟ๐—ถ๐—ป๐—ด ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด across text, image, audio, and video. No retraining needed. Fully plug-and-play. ๐Ÿ”Œ ๐Ÿ“„ Paper: github.com/eric-ai-lab/Omโ€ฆ ๐Ÿ’ป Code: github.com/eric-ai-lab/Omโ€ฆ ๐ŸŒ Project: jackie-2000.github.io/omnitrace.githโ€ฆ ๐Ÿ“ฆ pip install omnitrace ๐Ÿงต๐Ÿ‘‡
Qianqi "Jackie" Yan tweet media
English
1
6
17
1.7K
Zhaotian Weng
Zhaotian Weng@WengZhaoti39773ยท
Can VLMs really understand causal relationships in visual scenes? We introduce VQA-Causal and VCR-Causal, and show that VLMs struggle with causal order reasoning, often near random when shortcuts are removed. Check our EACL 2026 Oral Paper๐ŸŽ‰๐Ÿ‘‡ aclanthology.org/2026.eacl-longโ€ฆ
English
1
3
24
4.2K
UC Santa Barbara NLP Group retweetet
Xin Eric Wang
Xin Eric Wang@xwang_lkยท
๐ŸŽ‰ Introducing PARE: a new framework for evaluating proactive AI agents. Todayโ€™s agents are reactive. The next wave? Proactive agents that anticipate your needs, like adding โ€œsoapโ€ to your shopping list when your roommate texts you. ๐Ÿšง The challenge: you canโ€™t evaluate this with static benchmarks. ๐Ÿ PARE: active user simulation with realistic mobile interactions ๐Ÿ“ฑ Asymmetric design: agent โ‰  user view (just like real life) ๐Ÿ‘€ Observe โ†’ Execute: assist only when it matters ๐Ÿ“‹ PARE-Bench: 143 tasks, 9 apps, real-world complexity ๐Ÿ“Š Result: even top models hit just 42% success Built on Metaโ€™s ARE, PARE brings scalable, realistic evaluation to proactive AI.
Xin Eric Wang tweet media
Deepak Nathani@deepaknathani11

๐ŸŽ‰ Excited to share ๐Ÿ PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. ๐Ÿšง Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: ๐Ÿ PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone ๐Ÿ“ฑ Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment ๐Ÿ‘€ Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval ๐Ÿ“‹ PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration ๐Ÿ“Š Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon ๐Ÿ™

English
0
17
84
14.7K
UC Santa Barbara NLP Group retweetet
Deepak Nathani
Deepak Nathani@deepaknathani11ยท
๐ŸŽ‰ Excited to share ๐Ÿ PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. ๐Ÿšง Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: ๐Ÿ PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone ๐Ÿ“ฑ Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment ๐Ÿ‘€ Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval ๐Ÿ“‹ PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration ๐Ÿ“Š Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon ๐Ÿ™
Deepak Nathani tweet media
English
3
21
58
21K
UC Santa Barbara NLP Group retweetet
Tengxiao Liu
Tengxiao Liu@TengxiaoLiuยท
Auto research is on ๐Ÿ”ฅ We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. ๐ŸŒ™ Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there ๐Ÿงต
Tengxiao Liu tweet media
English
6
18
62
31K
UC Santa Barbara NLP Group retweetet
Saaket Agashe
Saaket Agashe@saa1605ยท
How do you teach a model to reason in domains where it can't even get started? RLVR needs successful rollouts to learn from. But if a model has never seen a domain (say, a niche programming language) or needs a new reasoning pattern, it just keeps failing with barely any learning signal. Our answer: In-Context Learning! How? Introducing Context Bootstrapped Reinforcement Learning (CBRL) ๐Ÿงต๐Ÿ‘‡ ๐Ÿ”— arxiv.org/abs/2603.18953 ๐ŸŒ context-bootstrapped-rl.github.io
Saaket Agashe tweet media
English
2
10
73
21.7K
UC Santa Barbara NLP Group
UC Santa Barbara NLP Group@ucsbNLPยท
๐ŸŽ‰ 9 papers (7 Main, 2 Findings) from UCSB NLP accepted to ICLR 2026 & CVPR 2026. Proud of our students and collaborators! #ICLR2026 #CVPR2026
UC Santa Barbara NLP Group tweet media
English
0
3
20
1.6K
UC Santa Barbara NLP Group retweetet
Chuhan Li
Chuhan Li@_Chuhan_Liยท
Human perception is inherently situated โ€“ we understand the world relative to our own body, viewpoint, and motion. To deploy multimodal foundation models in embodied settings, we ask: โ€œCan these models reason in the same observer-centric way?โ€ We study this through SAW-Bench: a novel benchmark for observer-centric situated awareness: - 786 real world egocentric videos - 2,071 human-annotated QA pairs Across all tasks, we evaluate 24 state-of-the-art MFMs: ๐Ÿ“‰ Best model: 53.9% ๐Ÿง‘ Humans: 91.6% Models systematically: โŒ Confuse head rotation with physical movement โŒ Collapse under multi-turn trajectories โŒ Fail to maintain persistent world-state memory ๐Ÿ‘‰ We see that maintaining a stable observer-centric representation remains challenging. As MFMs are increasingly integrated into embodied agents, situated awareness becomes essential for reliable real-world interaction. We release SAW-Bench and encourage further research toward improving observer-centric reasoning in multimodal foundation models.
English
5
26
112
25.5K
UC Santa Barbara NLP Group retweetet
UC Santa Barbara NLP Group retweetet
Zhen Zhang
Zhen Zhang@zhenzhangzzยท
AI agents are evolving beyond simple tasks to complex, multi-turn and multi-step interactions. But how do we train them with RL when verifiable rewards don't exist for open-ended conversations and building execution environments for thousands of tools is unscalable? Introducing ๐Ÿ› ๏ธCM2: RL with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [arxiv.org/abs/2602.12268] Core Contributions: ๐Ÿ”„ Multi-turn and Multi-step tool use senario โœ… Checklist Rewards: Replaces vague scalar scores with fine-grained, evidence-based binary criteria. ๐Ÿ› ๏ธ Scalable Tool Simulation: Trains on 5,000+ tools using a hybrid LLM simulator, removing the need for manual API engineering. ๐Ÿ‘ SOTA Performance: Achieves +8-12 point gains on ฯ„^2-Bench, BFCL-V4 & ToolSandbox, surpassing larger open-source models.
Zhen Zhang tweet media
English
3
4
32
14.8K
UC Santa Barbara NLP Group retweetet
Kaijie Zhu
Kaijie Zhu@KaijieZhu07ยท
[1/n] ๐Ÿšจ Coding โ‰  Software Engineering! Are AI agents ready to replace Software Engineers? ๐Ÿ”ฅ Introducing DevOps-Gym: The first end-to-end benchmark for the complete software cycle (UCSB, NUS, Berkeley, Google). We tested SOTA agents on 700+ real-world DevOps tasks. The Result? They struggle. ๐Ÿ“‰ ๐Ÿ”„ Full DevOps Coverage: ๐Ÿ”ง Build: Fix dependency hell & migrate systems (Mavenโ†’Gradle) ๐Ÿ“Š Monitor: Detect leaks using ONLY CLI tools (top/iostat). ๐Ÿ› Fix: Resolve bugs in compiled langs (Harder than Python!) โœ… Test: Gen regression tests from runtime behavior โ˜ ๏ธ The Ultimate Killer: End-to-End Pipelines (Build โ†’ Monitor โ†’ Fix โ†’ Test) Success Rate: 0.00%. NO agent could complete the full loop. ๐Ÿ”— Check out the full research & dataset: devops-gym.com ๐Ÿ“„ Paper: arxiv.org/abs/2601.20882
Kaijie Zhu tweet media
English
2
10
22
5.4K
UC Santa Barbara NLP Group retweetet
Xin Eric Wang
Xin Eric Wang@xwang_lkยท
๐‘๐ž๐ฅ๐ข๐š๐›๐ข๐ฅ๐ข๐ญ๐ฒ ๐ข๐ฌ ๐ญ๐ก๐ž ๐Ÿ๐ฎ๐ง๐๐š๐ฆ๐ž๐ง๐ญ๐š๐ฅ ๐›๐จ๐ญ๐ญ๐ฅ๐ž๐ง๐ž๐œ๐ค ๐Ÿ๐จ๐ซ ๐†๐”๐ˆ ๐š๐ ๐ž๐ง๐ญ๐ฌ.โš ๏ธ One wrong click can trigger irreversible, costly actions ๐Ÿ’ฅ Introducing ๐’๐š๐Ÿ๐ž๐†๐ซ๐จ๐ฎ๐ง๐๐Ÿ›ก๏ธ: an uncertainty-calibrated framework that knows when not to act, enabling risk-aware GUI grounding with statistical guarantees ๐Ÿ“Š ๐Š๐ž๐ฒ ๐ข๐๐ž๐š: the real danger is ๐ฌ๐ข๐ฅ๐ž๐ง๐ญ ๐Ÿ๐š๐ข๐ฅ๐ฎ๐ซ๐ž ๐Ÿคซ Most GUI grounding models always output a coordinate, even when theyโ€™re unsure โŒ๐Ÿ“ Instead, SafeGround: ๐Ÿ“ ๐˜Œ๐˜ด๐˜ต๐˜ช๐˜ฎ๐˜ข๐˜ต๐˜ฆ๐˜ด ๐˜ด๐˜ฑ๐˜ข๐˜ต๐˜ช๐˜ข๐˜ญ ๐˜ถ๐˜ฏ๐˜ค๐˜ฆ๐˜ณ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ต๐˜บ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ฑ๐˜ณ๐˜ฆ๐˜ฅ๐˜ช๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ ๐˜ท๐˜ข๐˜ณ๐˜ช๐˜ข๐˜ฃ๐˜ช๐˜ญ๐˜ช๐˜ต๐˜บ; ๐ŸŽฏ ๐˜Š๐˜ข๐˜ญ๐˜ช๐˜ฃ๐˜ณ๐˜ข๐˜ต๐˜ฆ๐˜ด ๐˜ข ๐˜ฅ๐˜ฆ๐˜ค๐˜ช๐˜ด๐˜ช๐˜ฐ๐˜ฏ ๐˜ต๐˜ฉ๐˜ณ๐˜ฆ๐˜ด๐˜ฉ๐˜ฐ๐˜ญ๐˜ฅ ๐˜ธ๐˜ช๐˜ต๐˜ฉ ๐˜ด๐˜ต๐˜ข๐˜ต๐˜ช๐˜ด๐˜ต๐˜ช๐˜ค๐˜ข๐˜ญ ๐˜จ๐˜ถ๐˜ข๐˜ณ๐˜ข๐˜ฏ๐˜ต๐˜ฆ๐˜ฆ๐˜ด; ๐Ÿ›‘ ๐˜ˆ๐˜ฃ๐˜ด๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ด ๐˜ฐ๐˜ณ ๐˜ฅ๐˜ฆ๐˜ง๐˜ฆ๐˜ณ๐˜ด ๐˜ฉ๐˜ช๐˜จ๐˜ฉ-๐˜ณ๐˜ช๐˜ด๐˜ฌ ๐˜ข๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด, ๐˜ฆ๐˜ฏ๐˜ข๐˜ฃ๐˜ญ๐˜ช๐˜ฏ๐˜จ ๐˜ณ๐˜ช๐˜ด๐˜ฌ-๐˜ค๐˜ฐ๐˜ฏ๐˜ต๐˜ณ๐˜ฐ๐˜ญ๐˜ญ๐˜ฆ๐˜ฅ ๐˜Ž๐˜œ๐˜ ๐˜ช๐˜ฏ๐˜ต๐˜ฆ๐˜ณ๐˜ข๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ, ๐˜ฆ๐˜ท๐˜ฆ๐˜ฏ ๐˜ง๐˜ฐ๐˜ณ ๐˜ฃ๐˜ญ๐˜ข๐˜ค๐˜ฌ-๐˜ฃ๐˜ฐ๐˜น ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ๐˜ด.๐Ÿ”’๐Ÿค–
Xin Eric Wang tweet media
Qingni Wang@Ceeqnn

๐Ÿšจ New paper alert ๐Ÿšจย  ๐Ÿ“Œ How can we make GUI grounding models reliable in real-world interactions?ย  We introduce ๐Ÿš€ SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration In GUI agents, a single wrong click isnโ€™t just an error โ€” it can trigger costly or irreversible actions (e.g., unintended payments ๐Ÿ’ธ or deleting important files ๐Ÿ—‘๏ธ).ย  The real danger is silent failure: most GUI grounding models always output a coordinate, even when theyโ€™re unsure.ย  Instead of trusting a single predicted point, SafeGround:ย  โ€ข estimates spatial uncertainty from prediction variabilityย  โ€ข calibrates a decision threshold with statistical guaranteesย  โ€ข enables risk-controlled GUI actions, even with black-box modelsย  ๐Ÿ’ป Code: github.com/Cece1031/SAFEGโ€ฆย  ๐Ÿ“„ Paper: arxiv.org/pdf/2602.02419 ๐Ÿงต1/6 #Agents #GUI

English
3
5
28
5.3K