UNC Computer Science

2.6K posts

UNC Computer Science banner
UNC Computer Science

UNC Computer Science

@unccs

Department of Computer Science - University of North Carolina at Chapel Hill Choose to #GIVE today - learn more here: https://t.co/cLdenfM5G5

Chapel Hill, NC Katılım Eylül 2009
433 Takip Edilen3.3K Takipçiler
UNC Computer Science retweetledi
Gedas Bertasius
Gedas Bertasius@gberta227·
Excited to announce that SiLVR has been accepted to @TmlrOrg! 🎉 A big shoutout to lead student @cezhhh, who ran a massive number of experiments and validated SiLVR on 8 complex video benchmarks. If you are looking for a strong, simple baseline to build advanced video agents on top of, SiLVR is an excellent choice! Project Page: sites.google.com/cs.unc.edu/sil… Code: github.com/CeeZh/SILVR
Accepted papers at TMLR@TmlrPub

SiLVR: A Simple Language-based Video Reasoning Framework Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius. Action editor: Anurag Arnab. openreview.net/forum?id=mQZbh… #multimodal #subtitles #captions

English
1
4
13
1.8K
UNC Computer Science retweetledi
Gedas Bertasius
Gedas Bertasius@gberta227·
If you're curious about the background that inspires a lot of our group's research on skill learning and video understanding, check out this great piece by UNC Research. It covers some of my journey from being a basketball player to an AI researcher. research.unc.edu/story/reading-…
English
0
7
18
873
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out V-Co: a careful study of how to bring pretrained visual features (e.g., DINOv2) into pixel-space diffusion through visual co-denoising (by jointly denoising an image pixel stream & a semantic feature stream from a frozen visual encoder)! What actually makes co-denoising work? Our controlled study reveals 4 key takeaways & an effective+efficient recipe: 1⃣ Preserve feature-specific computation → a fully dual-stream design works best 2⃣ Define CFG structurally, not just with input dropout → semantic-to-pixel masking gives a much better unconditional branch 3⃣ Use both instance-level and distribution-level semantic supervision → a perceptual-drifting hybrid loss works best 4⃣ Calibrate the two streams properly → RMS-based feature rescaling is simple and effective Putting these together gives a practical recipe for visual co-denoising in pixel-space diffusion, and makes V-Co highly competitive at comparable scale (e.g., V-Co-L/16 and V-Co-H/16 outperform JiT-G/16 (≈2B, FID 1.82) with fewer training epochs)! Details 👇
Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English
0
8
17
1.4K
UNC Computer Science retweetledi
Han Lin
Han Lin@hanlin_hl·
🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇
Han Lin tweet media
English
2
41
129
20.6K
UNC Computer Science retweetledi
Shoubin Yu
Shoubin Yu@shoubin621·
Check out our new work ⚽️VisionCoach, an RL + self-distillation framework for complex video reasoning. We combine reinforcement learning with dynamic visual prompting, where a visual prompt selector adaptively augments hard training examples based on reward signals. Visual grounding is key to accurate video reasoning. Instead of adding complexity at inference, we use visual prompting during training to guide models toward better spatio-temporal attention—then distill this capability into a simple, single-path model.
Daeun Lee@danadaeun

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English
0
7
18
2.4K
UNC Computer Science retweetledi
Daeun Lee
Daeun Lee@danadaeun·
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵
English
1
24
75
10.5K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
It was a pleasure to visit Georgetown and deliver a Distinguished Lecture in AI (in a national historic landmark*), and have engaging discussions about the present, future, and societal impact of calibrated, controllable, collaborative AI agents that plan & learn/improve skills, with the faculty+students+provost there 🙂 *PS. It was extra special to deliver the lecture in the historic 1891 Riggs Library (one of the few extant cast-iron libraries in the nation & known for its magical Hogwarts-like setting) inside Healy Hall, a National Historic Landmark and the flagship building of Georgetown, thanks again for the kind invitation!
Mohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet media
English
2
18
52
2.7K
UNC Computer Science retweetledi
Yu Fang
Yu Fang@yuffishh·
Do Vision-Language-Action Models truly follow your language instructions? We present When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs. They promise to ground language instructions in robot control, yet in practice, often fail to follow language faithfully. 📄 Paper: arxiv.org/abs/2602.17659 🌐 Project: vla-va.github.io 💡 Highlights Vision shortcuts and counterfactual failures. When given instructions that lack strong scene-specific supervision, they default to well-learned scene-specific behaviors regardless of language intent. Counterfactual benchmark. We introduce LIBERO-CF, the first counterfactual benchmark for evaluating language following in VLAs. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. Our solution. We propose Counterfactual Action Guidance (CAG), a simple plug-and-play dual-branch inference scheme that strengthens language conditioning without changing pretrained VLA architectures or weights. Experiments. CAG is effective across multiple dimensions of language grounding, consistently improving both language grounding and task success on under-observed tasks. #VLA #Robotics #Vision #Language
English
1
25
142
11K
UNC Computer Science retweetledi
UNC School of Data Science and Society
Where could data science take you? 🏈 Sports analytics 📈 Growth equity 🔐 Data & AI security Join @RENCI's live panel — moderated by SDSS faculty member Keri Smith — with professionals working across these fields and hear how their careers unfolded. 🔗 go.unc.edu/NCDScareerpane…
UNC School of Data Science and Society tweet media
English
0
1
1
367
UNC Computer Science retweetledi
Daeun Lee
Daeun Lee@danadaeun·
🥳 Happy to announce that StreamGaze is accepted to #CVPR2026! 👀 We introduce the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding for streaming video understanding. We find that all MLLMs fall far below human performance, particularly in temporal continuity, gaze grounding, and proactive prediction. 💗 Huge thanks to my last year's AdobeResearch team: Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Lai, David Seunghyun Yoon, Trung Bui, Franck Dernoncourt, and my advisor Mohit Bansal 😃
Daeun Lee@danadaeun

🤔 We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze 👀, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video settings. ➡️ Gaze-Guided Streaming Benchmark: 10 tasks spanning past, present, and proactive reasoning, from gaze-sequence matching to alerting when objects appear within the FOV area. ➡️ Gaze-Guided Streaming Data Construction Pipeline: We align egocentric videos with raw gaze trajectories using fixation extraction, region-specific visual prompting, and scanpath construction to generate spatio-temporally grounded QA pairs. This process is human-verified. ➡️ Comprehensive Evaluation of State-of-the-Art MLLMs: Across all gaze-conditioned streaming tasks, we highlight fundamental limits of current MLLMs. All MLLMs fall far below human performance. Models particularly struggle with temporal continuity, gaze grounding, and proactive prediction.

English
3
19
67
5.5K
Runchu Tian
Runchu Tian@Runchu_Tian·
🎉Excited to share that I’ll be starting my PhD at UNC Chapel Hill @UNC, joining MURGe-Lab, advised by Prof. Mohit Bansal @mohitban47! I’ll be working on multimodality, reasoning, and AI agents. New chapter begins! #PhD #NLP #UNCCH #Multimodal
Runchu Tian tweet media
English
17
17
100
7K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 If you are looking for an amazing research scientist in any topic related to Reasoning, RL/Post-Training, Reward+Skill Discovery, or Agents+Planning, you should definitely hire Archiki! She is an Apple AI/ML PhD Fellow doing foundational work in robust reasoning (self-improvement, error localization, new reward models) and scaling these methods to complex planning, coding, and info-seeking agents. 👇👇
Archiki Prasad@ArchikiPrasad

🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch

English
0
13
120
22.3K
UNC Computer Science retweetledi
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch
English
15
59
344
54.7K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out AnchorWeave, a local-3D-memory guided video generation framework to help long-horizon consistent world modeling --> ▪️ Avoids brittle global 3D fusion that accumulates cross-view misalignment errors ▪️ Retrieves clean local 3D memories (anchors) and learns to weave them during generation ▪️ Stronger long-horizon consistency, more controllable world exploration, and open-domain generalization Details 👇
Zun Wang@ZunWang919

🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇

English
2
14
24
3.3K
UNC Computer Science retweetledi
Zun Wang
Zun Wang@ZunWang919·
🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇
English
1
55
299
27.5K
UNC Computer Science retweetledi
Ziyang Wang
Ziyang Wang@ZiyangW00·
🚨Excited to share our new paper MuRGAt: “Multimodal Fact-Level Attribution for Verifiable Reasoning” Key finding: even strong MLLMs can be right on the final answer but wrong on the evidence (hallucinated citations / mis-grounded modality or timestamp). What MuRGAt adds: - Human annotations to judge whether each cited piece of evidence actually supports a claim. ✅ - Atomic fact decomposition to evaluate attribution at the fact level, not just the final answer. 🧩 - MuRGAt-SCORE, a metric that aligns well with human judgment. 📏 - Benchmarks across strong MLLMs + studies showing programmatic grounding can improve attribution. ⚖️ Paper + code + details in the original thread 👇
David Wan@meetdavidwan

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English
1
10
16
1.7K
UNC Computer Science retweetledi
David Wan
David Wan@meetdavidwan·
🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇
David Wan tweet media
English
2
28
52
9.2K
UNC Computer Science retweetledi