David Wan

28

52

9.5K

David Wan รีทวีตแล้ว

Han Lin@hanlin_hl·18 Mar

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English

41

128

21.2K

David Wan รีทวีตแล้ว

Daeun Lee@danadaeun·17 Mar

🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵

English

24

76

10.7K

David Wan รีทวีตแล้ว

Mohit Bansal@mohitban47·16 Mar

It was a pleasure to visit Georgetown and deliver a Distinguished Lecture in AI (in a national historic landmark*), and have engaging discussions about the present, future, and societal impact of calibrated, controllable, collaborative AI agents that plan & learn/improve skills, with the faculty+students+provost there 🙂 *PS. It was extra special to deliver the lecture in the historic 1891 Riggs Library (one of the few extant cast-iron libraries in the nation & known for its magical Hogwarts-like setting) inside Healy Hall, a National Historic Landmark and the flagship building of Georgetown, thanks again for the kind invitation!

English

18

53

2.8K

David Wan รีทวีตแล้ว

Daeun Lee@danadaeun·23 Şub

🥳 Happy to announce that StreamGaze is accepted to #CVPR2026! 👀 We introduce the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding for streaming video understanding. We find that all MLLMs fall far below human performance, particularly in temporal continuity, gaze grounding, and proactive prediction. 💗 Huge thanks to my last year's AdobeResearch team: Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Lai, David Seunghyun Yoon, Trung Bui, Franck Dernoncourt, and my advisor Mohit Bansal 😃

Daeun Lee@danadaeun

🤔 We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze 👀, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video settings. ➡️ Gaze-Guided Streaming Benchmark: 10 tasks spanning past, present, and proactive reasoning, from gaze-sequence matching to alerting when objects appear within the FOV area. ➡️ Gaze-Guided Streaming Data Construction Pipeline: We align egocentric videos with raw gaze trajectories using fixation extraction, region-specific visual prompting, and scanpath construction to generate spatio-temporally grounded QA pairs. This process is human-verified. ➡️ Comprehensive Evaluation of State-of-the-Art MLLMs: Across all gaze-conditioned streaming tasks, we highlight fundamental limits of current MLLMs. All MLLMs fall far below human performance. Models particularly struggle with temporal continuity, gaze grounding, and proactive prediction.

English

3

19

67

5.6K

David Wan รีทวีตแล้ว

Elias Stengel-Eskin@EliasEskin·19 Şub

🚨 Excited to share Reasoning Execution by Multiple Listeners (REMuL), a multi-party training method for faithful reasoning. Consistently boosts faithfulness evals (hint attribution, early answering, mistake injection) across diverse reasoning tasks while maintaining accuracy! ➡️ Faithfulness is key for CoT interpretability but current LLMs produce unfaithful reasoning that is hard to follow, with standard outcome-focused RL hurting faithfulness. ➡️ REMuL approaches faithfulness through the lens of executability. A CoT is faithful if independent "listener" models can follow/execute a truncated CoT prefix and reliably arrive at the same conclusion as the “speaker” model. ➡️ REMuL trains the speaker via GRPO to produce reasoning that achieves consistent answers among listeners, while maintaining correctness via masked supervised finetuning. ➡️ Interestingly, REMuL's multi-party training generalizes better. Directly optimizing for faithfulness metrics improves those metrics alone, but not others, while REMuL improves across metrics! 🧵👇

English

25

37

5.8K

David Wan รีทวีตแล้ว

Runchu Tian@Runchu_Tian·19 Şub

🎉Excited to share that I’ll be starting my PhD at UNC Chapel Hill @UNC, joining MURGe-Lab, advised by Prof. Mohit Bansal @mohitban47! I’ll be working on multimodality, reasoning, and AI agents. New chapter begins! #PhD #NLP #UNCCH #Multimodal

English

17

99

7K

David Wan รีทวีตแล้ว

Archiki Prasad@ArchikiPrasad·18 Şub

🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch

English

15

59

344

55.2K

David Wan รีทวีตแล้ว

Zun Wang@ZunWang919·17 Şub

🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇

English

55

298

27.6K

David Wan รีทวีตแล้ว

Elias Stengel-Eskin@EliasEskin·13 Şub

When reasoning over multimodal data, we want MLLMs to attribute their answers to the source (audio, video, etc.) We introduce the MuRGAt benchmark and MuRGAt-Score to test MLLM citation, & eval top models against human annotations! Even high acc models have a long way to go!

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

8

18

1.6K

David Wan รีทวีตแล้ว

Ziyang Wang@ZiyangW00·13 Şub

🚨Excited to share our new paper MuRGAt: “Multimodal Fact-Level Attribution for Verifiable Reasoning” Key finding: even strong MLLMs can be right on the final answer but wrong on the evidence (hallucinated citations / mis-grounded modality or timestamp). What MuRGAt adds: - Human annotations to judge whether each cited piece of evidence actually supports a claim. ✅ - Atomic fact decomposition to evaluate attribution at the fact level, not just the final answer. 🧩 - MuRGAt-SCORE, a metric that aligns well with human judgment. 📏 - Benchmarks across strong MLLMs + studies showing programmatic grounding can improve attribution. ⚖️ Paper + code + details in the original thread 👇

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

10

16

1.7K

David Wan รีทวีตแล้ว

Han Wang@HanWang98·13 Şub

🎉 Excited to share our new work MuRGAt! We show that high multimodal reasoning accuracy does not guarantee faithful grounding — MLLMs often answer correctly while hallucinating where their facts come from. Multimodal fact-level attribution remains a big open challenge! More details in the thread below ⬇️

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

7

10

1.4K

David Wan รีทวีตแล้ว

hyunji amy lee@hyunji_amy_lee·13 Şub

🧐MLLMs are improving at reasoning tasks, but do they actually reason with correct sources? We introduce MuRGAt, a benchmark for Multimodal Reasoning with Grounded Attribution: ❗️Even strong MLLMs often hallucinate citations despite answering correctly. ❗️There’s a trade-off between reasoning and attribution: increased thinking can improve reasoning while degrading grounding, and programmatic grounding boosts attribution at the cost of reasoning accuracy. More details in the thread below ⬇️

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

10

28

1.8K

David Wan@meetdavidwan·13 Şub

Thank you for reading! Check out the paper and code to test your models! 📄 Paper: arxiv.org/abs/2602.11509 🔗 Code: github.com/meetdavidwan/m… 🤗 HF Page: huggingface.co/papers/2602.11… Great collaboration with @HanWang98 @ZiyangW00 @EliasEskin @hyunji_amy_lee and @mohitban47 ! @unccs @unc_ai_group

English

2

13

408

David Wan@meetdavidwan·13 Şub

💻 Can we fix this with code-generation pipelines? We explored programmatic approaches that decouple reasoning from citation generation (e.g., a "plan-then-execute" paradigm). Programmatic methods improved attribution quality (avg. +9.6 MuRGAt-SCORE) over baselines! However, there's a distinct trade-off: forcing explicit structured grounding often degrades reasoning performance/accuracy in complex tasks.

English

0

10

140

David Wan@meetdavidwan·13 Şub

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English