Han Wang

24

75

10.5K

Han Wang retweetledi

Peter Hase@peterbhase·4 Mar

Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)

English

12

36

210

20.9K

Han Wang retweetledi

Daeun Lee@danadaeun·23 Şub

🥳 Happy to announce that StreamGaze is accepted to #CVPR2026! 👀 We introduce the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding for streaming video understanding. We find that all MLLMs fall far below human performance, particularly in temporal continuity, gaze grounding, and proactive prediction. 💗 Huge thanks to my last year's AdobeResearch team: Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Lai, David Seunghyun Yoon, Trung Bui, Franck Dernoncourt, and my advisor Mohit Bansal 😃

Daeun Lee@danadaeun

🤔 We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze 👀, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video settings. ➡️ Gaze-Guided Streaming Benchmark: 10 tasks spanning past, present, and proactive reasoning, from gaze-sequence matching to alerting when objects appear within the FOV area. ➡️ Gaze-Guided Streaming Data Construction Pipeline: We align egocentric videos with raw gaze trajectories using fixation extraction, region-specific visual prompting, and scanpath construction to generate spatio-temporally grounded QA pairs. This process is human-verified. ➡️ Comprehensive Evaluation of State-of-the-Art MLLMs: Across all gaze-conditioned streaming tasks, we highlight fundamental limits of current MLLMs. All MLLMs fall far below human performance. Models particularly struggle with temporal continuity, gaze grounding, and proactive prediction.

English

3

19

67

5.5K

Han Wang retweetledi

Elias Stengel-Eskin@EliasEskin·19 Şub

🚨 Excited to share Reasoning Execution by Multiple Listeners (REMuL), a multi-party training method for faithful reasoning. Consistently boosts faithfulness evals (hint attribution, early answering, mistake injection) across diverse reasoning tasks while maintaining accuracy! ➡️ Faithfulness is key for CoT interpretability but current LLMs produce unfaithful reasoning that is hard to follow, with standard outcome-focused RL hurting faithfulness. ➡️ REMuL approaches faithfulness through the lens of executability. A CoT is faithful if independent "listener" models can follow/execute a truncated CoT prefix and reliably arrive at the same conclusion as the “speaker” model. ➡️ REMuL trains the speaker via GRPO to produce reasoning that achieves consistent answers among listeners, while maintaining correctness via masked supervised finetuning. ➡️ Interestingly, REMuL's multi-party training generalizes better. Directly optimizing for faithfulness metrics improves those metrics alone, but not others, while REMuL improves across metrics! 🧵👇

English

25

37

5.8K

Han Wang@HanWang98·19 Şub

@Runchu_Tian @UNC @mohitban47 Welcome to the lab, Runchu! 🥳🥳 I look forward to collaborating with you! 🤗

English

0

3

203

Han Wang retweetledi

Runchu Tian@Runchu_Tian·19 Şub

🎉Excited to share that I’ll be starting my PhD at UNC Chapel Hill @UNC, joining MURGe-Lab, advised by Prof. Mohit Bansal @mohitban47! I’ll be working on multimodality, reasoning, and AI agents. New chapter begins! #PhD #NLP #UNCCH #Multimodal

English

17

100

7K

Han Wang retweetledi

AK@_akhaliq·18 Şub

Multimodal Fact-Level Attribution for Verifiable Reasoning huggingface.co/papers/2602.11…

English

3

5

24

6.4K

Han Wang retweetledi

Archiki Prasad@ArchikiPrasad·18 Şub

🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch

English

15

59

344

54.7K

Han Wang retweetledi

Zun Wang@ZunWang919·17 Şub

🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇

English

55

299

27.5K

Han Wang retweetledi

Elias Stengel-Eskin@EliasEskin·13 Şub

When reasoning over multimodal data, we want MLLMs to attribute their answers to the source (audio, video, etc.) We introduce the MuRGAt benchmark and MuRGAt-Score to test MLLM citation, & eval top models against human annotations! Even high acc models have a long way to go!

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

8

18

1.6K

Han Wang retweetledi

Ziyang Wang@ZiyangW00·13 Şub

🚨Excited to share our new paper MuRGAt: “Multimodal Fact-Level Attribution for Verifiable Reasoning” Key finding: even strong MLLMs can be right on the final answer but wrong on the evidence (hallucinated citations / mis-grounded modality or timestamp). What MuRGAt adds: - Human annotations to judge whether each cited piece of evidence actually supports a claim. ✅ - Atomic fact decomposition to evaluate attribution at the fact level, not just the final answer. 🧩 - MuRGAt-SCORE, a metric that aligns well with human judgment. 📏 - Benchmarks across strong MLLMs + studies showing programmatic grounding can improve attribution. ⚖️ Paper + code + details in the original thread 👇

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

10

16

1.7K

Han Wang@HanWang98·13 Şub

🎉 Excited to share our new work MuRGAt! We show that high multimodal reasoning accuracy does not guarantee faithful grounding — MLLMs often answer correctly while hallucinating where their facts come from. Multimodal fact-level attribution remains a big open challenge! More details in the thread below ⬇️

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

7

10

1.4K

Han Wang retweetledi

hyunji amy lee@hyunji_amy_lee·13 Şub

🧐MLLMs are improving at reasoning tasks, but do they actually reason with correct sources? We introduce MuRGAt, a benchmark for Multimodal Reasoning with Grounded Attribution: ❗️Even strong MLLMs often hallucinate citations despite answering correctly. ❗️There’s a trade-off between reasoning and attribution: increased thinking can improve reasoning while degrading grounding, and programmatic grounding boosts attribution at the cost of reasoning accuracy. More details in the thread below ⬇️

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

10

28

1.8K

Han Wang retweetledi

David Wan@meetdavidwan·13 Şub

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

2

28

52

9.2K

Han Wang retweetledi

Archiki Prasad@ArchikiPrasad·11 Şub

🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)

English

5

45

186

24.2K

Han Wang retweetledi

Shoubin Yu@shoubin621·10 Şub

🚨 Excited to share AVIC — an analysis and framework for adaptive test-time scaling with world model imagination in visual spatial reasoning. 📉 Always-on visual imagination is often unnecessary, or even misleading. 📈 AVIC treats visual imagination as a selective, query-dependent test-time resource—showing that better spatial reasoning comes from deciding when and how much to imagine, not from imagining more. ➡️ Across spatial reasoning & embodied navigation, we get stronger accuracy with far fewer world-model calls and tokens. 🧵👇[1/6]

English

3

38

88

15.7K

Han Wang retweetledi

Mohit Bansal@mohitban47·6 Şub

🚨 If you are looking for a very strong researcher who can bridge the gap between factual reliability and complex multimodal reasoning, definitely check out David (he is a Google PhD Fellow with several useful contributions in faithfulness & hallucination mitigation, fine-grained attribution, multimodal retrieval, etc.) 👇👇

🚀 I'm on the 2026 Research Scientist Job Market! I am a Google PhD Fellow at UNC (advised by @mohitban47). I work on Faithful and Multimodal AI, focusing on reducing hallucinations and improving reasoning in generation tasks by: 🔹 Faithfulness & Hallucination Mitigation: Developing metrics and methods to ensure model outputs are factually consistent (e.g., FactPEGASUS, PrefixNLI). 🔹 Fine-grained Attribution & RAG: Creating frameworks that allow models to cite their sources and reason transparently (e.g., GenerationPrograms, LAQuer). 🔹 Multimodal Reasoning & Retrieval: Grounding vision-language models to reduce hallucinations in cross-modal tasks (e.g., CLaMR, Contrastive Region Guidance). Prev Intern: Google, Meta, Salesforce, Amazon. 🔗 meetdavidwan.github.io #NLP #AI #JobSearch

English