Zaid Khan

585 posts

Zaid Khan

Zaid Khan

@codezakh

NDSEG Fellow / PhD @uncnlp with @mohitban47 working on automating env/data generation + program synthesis formerly @allenai @neclabsamerica

Boston, USA 가입일 Haziran 2023
1K 팔로잉595 팔로워
고정된 트윗
Zaid Khan
Zaid Khan@codezakh·
How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in “one life”, without millions of steps + human-provided goals / rewards? In our work, we: 1️⃣ infer an executable symbolic world model (a probabilistic program capturing environment dynamics) offline from one life (1 episode), with dynamic graph routing for credit assignment. 2️⃣ develop Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state. 3️⃣ implement 20+ executable scenarios to test knowledge of core mechanics + mutators that generate illegal distractor states to probe world model understanding. 4️⃣ introduce an evaluation protocol that measures the ability to distinguish plausible future states from implausible ones + the ability to generate future states that closely resemble reality. 5️⃣ show that the inferred world model can be used as a simulator for planning. Thread 🧵👇
GIF
English
3
45
94
33.6K
Zaid Khan 리트윗함
Jaemin Cho
Jaemin Cho@jmin__cho·
🥳 I am incredibly honored and grateful to receive the 2026 @UNC Distinguished Dissertation Award! This award recognizes four recipients across the whole university, and I’m humbled to represent the Mathematics, Physical Sciences, and Engineering category this year. Many thanks to my advisor @mohitban47, our MURGe-Lab family, and the @unccs @unc_ai_group for their constant support! 🙏 This is a great reminder of all the good memories from my PhD journey before I start my faculty career at The Johns Hopkins University 😊
Jaemin Cho tweet media
English
9
16
78
5.7K
Zaid Khan 리트윗함
Jaemin Cho
Jaemin Cho@jmin__cho·
Introducing VFig! I love creating diagrams with nano-banana, but refining them with only text prompts is notoriously hard. To build a grounded, iterative-refinement interface for diagram creators, we trained VFig, a VLM that infers SVG code to reconstruct original images, enabling users to further edit it directly in the code space. Give your favorite paper figure screenshot to VFig! 👇
Zixian Ma@zixianma02

Ever come across a beautiful Figure 1 in a paper, only to wish you could easily edit and adapt it for your own use? Check out our new work VFig: Vectorizing Complex Figures in SVG with Vision-Language Models! It is a specialized VLM that converts any diagram – simple and complex – into editable and clean SVG code. Built on Qwen3-VL 4B with SFT & RL, it matches GPT5.2’s performance on converting complex diagrams into SVG code and outperforms open-source generalists and specialists on simple-to-complex diagram vectorization. 🕹️Try it now on our demo: tinyurl.com/vfig-demo

English
0
4
23
2.8K
Zaid Khan 리트윗함
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 New #CVPR2026 collaboration with Google DeepMind --> Ego2Web bridges egocentric video perception and web execution, enabling agents that see the first-person real-world video of the user’s surroundings, and take actions on the web grounded in the egocentric video: ▪️ Introduces a task where agents must ground egocentric video (first-person view) into concrete web actions (requires visual grounding → entity extraction → planning → real website execution). ▪️Covers realistic cross-domain tasks e.g., e-commerce (find/buy items you saw), media retrieval (find related videos), knowledge lookup (identify & query entities), maps/local (locate places from visual cues). ▪️Proposes Ego2WebJudge to automatically evaluate whether web agent results are correctly grounded in the video context. ▪️Reveals concrete failure modes across 6 strong agents (GPT-5.4, Claude, Gemini-based agents, etc.): weak visual grounding, brittle cross-modal reasoning, and planning breakdowns (only ~58% success rate). Details 👇👇
Shoubin Yu@shoubin621

Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026. AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world. Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video. This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents. 🧵👇

English
0
10
35
5.9K
Zaid Khan 리트윗함
Shoubin Yu
Shoubin Yu@shoubin621·
Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026. AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world. Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video. This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents. 🧵👇
English
9
47
132
40.5K
Zaid Khan 리트윗함
Han Lin
Han Lin@hanlin_hl·
🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇
Han Lin tweet media
English
2
41
129
21.1K
Zaid Khan 리트윗함
Daeun Lee
Daeun Lee@danadaeun·
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation! 🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence. ⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation. ⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA). 👇🧵
English
1
24
76
10.7K
Zaid Khan 리트윗함
Mohit Bansal
Mohit Bansal@mohitban47·
It was a pleasure to visit Georgetown and deliver a Distinguished Lecture in AI (in a national historic landmark*), and have engaging discussions about the present, future, and societal impact of calibrated, controllable, collaborative AI agents that plan & learn/improve skills, with the faculty+students+provost there 🙂 *PS. It was extra special to deliver the lecture in the historic 1891 Riggs Library (one of the few extant cast-iron libraries in the nation & known for its magical Hogwarts-like setting) inside Healy Hall, a National Historic Landmark and the flagship building of Georgetown, thanks again for the kind invitation!
Mohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet mediaMohit Bansal tweet media
English
2
18
53
2.8K
Zaid Khan 리트윗함
Kianté Brantley
Kianté Brantley@xkianteb·
Does LLM RL post-training need to be on-policy?
English
10
45
328
111.5K
Zaid Khan 리트윗함
Daeun Lee
Daeun Lee@danadaeun·
🥳 Happy to announce that StreamGaze is accepted to #CVPR2026! 👀 We introduce the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding for streaming video understanding. We find that all MLLMs fall far below human performance, particularly in temporal continuity, gaze grounding, and proactive prediction. 💗 Huge thanks to my last year's AdobeResearch team: Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Lai, David Seunghyun Yoon, Trung Bui, Franck Dernoncourt, and my advisor Mohit Bansal 😃
Daeun Lee@danadaeun

🤔 We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze 👀, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video settings. ➡️ Gaze-Guided Streaming Benchmark: 10 tasks spanning past, present, and proactive reasoning, from gaze-sequence matching to alerting when objects appear within the FOV area. ➡️ Gaze-Guided Streaming Data Construction Pipeline: We align egocentric videos with raw gaze trajectories using fixation extraction, region-specific visual prompting, and scanpath construction to generate spatio-temporally grounded QA pairs. This process is human-verified. ➡️ Comprehensive Evaluation of State-of-the-Art MLLMs: Across all gaze-conditioned streaming tasks, we highlight fundamental limits of current MLLMs. All MLLMs fall far below human performance. Models particularly struggle with temporal continuity, gaze grounding, and proactive prediction.

English
3
19
67
5.6K
Zaid Khan 리트윗함
Elias Stengel-Eskin
Elias Stengel-Eskin@EliasEskin·
🚨 Excited to share Reasoning Execution by Multiple Listeners (REMuL), a multi-party training method for faithful reasoning. Consistently boosts faithfulness evals (hint attribution, early answering, mistake injection) across diverse reasoning tasks while maintaining accuracy! ➡️ Faithfulness is key for CoT interpretability but current LLMs produce unfaithful reasoning that is hard to follow, with standard outcome-focused RL hurting faithfulness. ➡️ REMuL approaches faithfulness through the lens of executability. A CoT is faithful if independent "listener" models can follow/execute a truncated CoT prefix and reliably arrive at the same conclusion as the “speaker” model. ➡️ REMuL trains the speaker via GRPO to produce reasoning that achieves consistent answers among listeners, while maintaining correctness via masked supervised finetuning. ➡️ Interestingly, REMuL's multi-party training generalizes better. Directly optimizing for faithfulness metrics improves those metrics alone, but not others, while REMuL improves across metrics! 🧵👇
Elias Stengel-Eskin tweet media
English
1
25
37
5.8K
Zaid Khan 리트윗함
Runchu Tian
Runchu Tian@Runchu_Tian·
🎉Excited to share that I’ll be starting my PhD at UNC Chapel Hill @UNC, joining MURGe-Lab, advised by Prof. Mohit Bansal @mohitban47! I’ll be working on multimodality, reasoning, and AI agents. New chapter begins! #PhD #NLP #UNCCH #Multimodal
Runchu Tian tweet media
English
17
17
99
7K
Zaid Khan 리트윗함
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch
English
15
59
343
55.2K
Zaid Khan 리트윗함
Zun Wang
Zun Wang@ZunWang919·
🚀 Excited to share AnchorWeave — a local-memory-augmented framework for world-consistent long-horizon video generation. - Global 3D reconstruction as memory accumulates cross-view misalignment and contaminates conditioning signals. - We replace a single noisy global 3D memory with multiple retrieved local 3D memories and learn to weave them. - Stronger long-horizon scene consistency and generalization ability. 🧵👇
English
1
55
298
27.6K
Zaid Khan 리트윗함
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)
Archiki Prasad tweet media
English
5
44
186
24.4K
Zaid Khan 리트윗함
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out Adaptive Visual Imagination Control (AVIC), an analysis-driven framework for adaptive test-time scaling via world model imagination (when & how much is useful & not misleading) in visual spatial reasoning. ▪️ Always-on imagination is not always helpful, and can in fact even mislead reasoning. ▪️ AVIC introduces selective imagination control—deciding when and how much to imagine based on the query and available visual evidence. ▪️ This leads to strong accuracy gains on visual spatial reasoning & embodied navigation tasks with far fewer world-model calls and tokens. 👇👇
Shoubin Yu@shoubin621

🚨 Excited to share AVIC — an analysis and framework for adaptive test-time scaling with world model imagination in visual spatial reasoning. 📉 Always-on visual imagination is often unnecessary, or even misleading. 📈 AVIC treats visual imagination as a selective, query-dependent test-time resource—showing that better spatial reasoning comes from deciding when and how much to imagine, not from imagining more. ➡️ Across spatial reasoning & embodied navigation, we get stronger accuracy with far fewer world-model calls and tokens. 🧵👇[1/6]

English
2
6
26
2.8K
Zaid Khan 리트윗함
Shoubin Yu
Shoubin Yu@shoubin621·
🚨 Excited to share AVIC — an analysis and framework for adaptive test-time scaling with world model imagination in visual spatial reasoning. 📉 Always-on visual imagination is often unnecessary, or even misleading. 📈 AVIC treats visual imagination as a selective, query-dependent test-time resource—showing that better spatial reasoning comes from deciding when and how much to imagine, not from imagining more. ➡️ Across spatial reasoning & embodied navigation, we get stronger accuracy with far fewer world-model calls and tokens. 🧵👇[1/6]
English
3
38
88
15.8K
Zaid Khan 리트윗함
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 If you are looking for a very strong researcher who can bridge the gap between factual reliability and complex multimodal reasoning, definitely check out David (he is a Google PhD Fellow with several useful contributions in faithfulness & hallucination mitigation, fine-grained attribution, multimodal retrieval, etc.) 👇👇
David Wan@meetdavidwan

🚀 I'm on the 2026 Research Scientist Job Market! I am a Google PhD Fellow at UNC (advised by @mohitban47). I work on Faithful and Multimodal AI, focusing on reducing hallucinations and improving reasoning in generation tasks by: 🔹 Faithfulness & Hallucination Mitigation: Developing metrics and methods to ensure model outputs are factually consistent (e.g., FactPEGASUS, PrefixNLI). 🔹 Fine-grained Attribution & RAG: Creating frameworks that allow models to cite their sources and reason transparently (e.g., GenerationPrograms, LAQuer). 🔹 Multimodal Reasoning & Retrieval: Grounding vision-language models to reduce hallucinations in cross-modal tasks (e.g., CLaMR, Contrastive Region Guidance). Prev Intern: Google, Meta, Salesforce, Amazon. 🔗 meetdavidwan.github.io #NLP #AI #JobSearch

English
0
14
55
9.4K
Zaid Khan 리트윗함
Wasu Top Piriyakulkij
Wasu Top Piriyakulkij@topwasu·
New paper: Hierarchical Neural Options + Abstract World Model w/ @WLehrach, @ellisk_kellis*, @sirbayes* (*equal advising) Q1: How to move beyond the "one-step trap in AI research", building a temporally abstract model of the world? Q2: How to acquire increasingly complex skills over time by building on existing ones? A1+A2: We propose AgentOWL, an agent that jointly learns hierarchical neural options and an abstract world model
Wasu Top Piriyakulkij tweet media
English
2
17
131
27K