David Wan

299 posts

David Wan

David Wan

@meetdavidwan

𝗢𝗻 𝘁𝗵𝗲 𝗜𝗻𝗱𝘂𝘀𝘁𝗿𝘆 𝗝𝗼𝗯 𝗠𝗮𝗿𝗸𝗲𝘁 | PhD student at @unccs advised by @mohitban47 | @Google PhD Fellow| prev: @AmazonScience, @MetaAI, @SFResearch

เข้าร่วม Şubat 2019
485 กำลังติดตาม697 ผู้ติดตาม
ทวีตที่ปักหมุด
David Wan
David Wan@meetdavidwan·
🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇
David Wan tweet media
English
2
28
52
9.5K
David Wan รีทวีตแล้ว
Han Lin
Han Lin@hanlin_hl·
🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇
Han Lin tweet media
English
2
41
128
21.2K
David Wan รีทวีตแล้ว
Daeun Lee
Daeun Lee@danadaeun·
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵
English
1
24
76
10.7K
David Wan รีทวีตแล้ว
Mohit Bansal
Mohit Bansal@mohitban47·
It was a pleasure to visit Georgetown and deliver a Distinguished Lecture in AI (in a national historic landmark*), and have engaging discussions about the present, future, and societal impact of calibrated, controllable, collaborative AI agents that plan & learn/improve skills, with the faculty+students+provost there 🙂 *PS. It was extra special to deliver the lecture in the historic 1891 Riggs Library (one of the few extant cast-iron libraries in the nation & known for its magical Hogwarts-like setting) inside Healy Hall, a National Historic Landmark and the flagship building of Georgetown, thanks again for the kind invitation!
Mohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet media
English
2
18
53
2.8K
David Wan รีทวีตแล้ว
Daeun Lee
Daeun Lee@danadaeun·
🥳 Happy to announce that StreamGaze is accepted to #CVPR2026! 👀 We introduce the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding for streaming video understanding. We find that all MLLMs fall far below human performance, particularly in temporal continuity, gaze grounding, and proactive prediction. 💗 Huge thanks to my last year's AdobeResearch team: Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Lai, David Seunghyun Yoon, Trung Bui, Franck Dernoncourt, and my advisor Mohit Bansal 😃
Daeun Lee@danadaeun

🤔 We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze 👀, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video settings. ➡️ Gaze-Guided Streaming Benchmark: 10 tasks spanning past, present, and proactive reasoning, from gaze-sequence matching to alerting when objects appear within the FOV area. ➡️ Gaze-Guided Streaming Data Construction Pipeline: We align egocentric videos with raw gaze trajectories using fixation extraction, region-specific visual prompting, and scanpath construction to generate spatio-temporally grounded QA pairs. This process is human-verified. ➡️ Comprehensive Evaluation of State-of-the-Art MLLMs: Across all gaze-conditioned streaming tasks, we highlight fundamental limits of current MLLMs. All MLLMs fall far below human performance. Models particularly struggle with temporal continuity, gaze grounding, and proactive prediction.

English
3
19
67
5.6K
David Wan รีทวีตแล้ว
Elias Stengel-Eskin
Elias Stengel-Eskin@EliasEskin·
🚨 Excited to share Reasoning Execution by Multiple Listeners (REMuL), a multi-party training method for faithful reasoning. Consistently boosts faithfulness evals (hint attribution, early answering, mistake injection) across diverse reasoning tasks while maintaining accuracy! ➡️ Faithfulness is key for CoT interpretability but current LLMs produce unfaithful reasoning that is hard to follow, with standard outcome-focused RL hurting faithfulness. ➡️ REMuL approaches faithfulness through the lens of executability. A CoT is faithful if independent "listener" models can follow/execute a truncated CoT prefix and reliably arrive at the same conclusion as the “speaker” model. ➡️ REMuL trains the speaker via GRPO to produce reasoning that achieves consistent answers among listeners, while maintaining correctness via masked supervised finetuning. ➡️ Interestingly, REMuL's multi-party training generalizes better. Directly optimizing for faithfulness metrics improves those metrics alone, but not others, while REMuL improves across metrics! 🧵👇
Elias Stengel-Eskin tweet media
English
1
25
37
5.8K
David Wan รีทวีตแล้ว
Runchu Tian
Runchu Tian@Runchu_Tian·
🎉Excited to share that I’ll be starting my PhD at UNC Chapel Hill @UNC, joining MURGe-Lab, advised by Prof. Mohit Bansal @mohitban47! I’ll be working on multimodality, reasoning, and AI agents. New chapter begins! #PhD #NLP #UNCCH #Multimodal
Runchu Tian tweet media
English
17
17
99
7K
David Wan รีทวีตแล้ว
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch
English
15
59
344
55.2K
David Wan รีทวีตแล้ว
Zun Wang
Zun Wang@ZunWang919·
🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇
English
1
55
298
27.6K
David Wan รีทวีตแล้ว
David Wan รีทวีตแล้ว
Ziyang Wang
Ziyang Wang@ZiyangW00·
🚨Excited to share our new paper MuRGAt: “Multimodal Fact-Level Attribution for Verifiable Reasoning” Key finding: even strong MLLMs can be right on the final answer but wrong on the evidence (hallucinated citations / mis-grounded modality or timestamp). What MuRGAt adds: - Human annotations to judge whether each cited piece of evidence actually supports a claim. ✅ - Atomic fact decomposition to evaluate attribution at the fact level, not just the final answer. 🧩 - MuRGAt-SCORE, a metric that aligns well with human judgment. 📏 - Benchmarks across strong MLLMs + studies showing programmatic grounding can improve attribution. ⚖️ Paper + code + details in the original thread 👇
David Wan@meetdavidwan

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English
1
10
16
1.7K
David Wan รีทวีตแล้ว
David Wan รีทวีตแล้ว
hyunji amy lee
hyunji amy lee@hyunji_amy_lee·
🧐MLLMs are improving at reasoning tasks, but do they actually reason with correct sources? We introduce MuRGAt, a benchmark for Multimodal Reasoning with Grounded Attribution: ❗️Even strong MLLMs often hallucinate citations despite answering correctly. ❗️There’s a trade-off between reasoning and attribution: increased thinking can improve reasoning while degrading grounding, and programmatic grounding boosts attribution at the cost of reasoning accuracy. More details in the thread below ⬇️
David Wan@meetdavidwan

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English
0
10
28
1.8K
David Wan
David Wan@meetdavidwan·
💻 Can we fix this with code-generation pipelines? We explored programmatic approaches that decouple reasoning from citation generation (e.g., a "plan-then-execute" paradigm). Programmatic methods improved attribution quality (avg. +9.6 MuRGAt-SCORE) over baselines! However, there's a distinct trade-off: forcing explicit structured grounding often degrades reasoning performance/accuracy in complex tasks.
David Wan tweet media
English
1
0
10
140
David Wan
David Wan@meetdavidwan·
🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇
David Wan tweet media
English
2
28
52
9.5K