Xuan He

12 posts

Xuan He

Xuan He

@XuanHe21

CS Ph.D @ UIUC Previous: Undergrad @Tsinghua_Uni (21-25) Intern at TIGER-Lab @UWaterloo | PLUS-Lab @UCLA

Beijing, China Katılım Ekim 2023
78 Takip Edilen19 Takipçiler
Xuan He retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
How much of “video understanding” is actually… not about video? We found that 40–60% of questions in popular benchmarks (VideoMME, MMVU) can be answered without watching the video. And it gets worse as models scale. 🧵 This problem doesn’t just affect evaluation. It’s baked into post-training data. So when you do SFT / RL, a large portion of the “gain” actually comes from better language priors, not better visual grounding. We propose a simple fix: VidGround 👉 Filter out text-only-answerable questions 👉 Keep only visually grounded data That’s it. Surprisingly, less data works better: • Only 69.1% of the data • +6.2 pts improvement • Outperforms more complex RL pipelines Key takeaway: - If your data allows shortcutting, your model will learn shortcuts. - For video understanding: Grounding signal > data scale > algorithm tricks 📄 huggingface.co/papers/2604.05… 🌐 vidground.etuagi.com
Dongfu Jiang tweet media
English
3
6
22
1.2K
Xuan He retweetledi
Microsoft Research
Microsoft Research@MSFTResearch·
PlugMem transforms AI agents’ interaction histories into structured, reusable knowledge. It integrates with any agent, supports diverse tasks and memory types, and maximizes decision quality while significantly reducing memory token use: msft.it/6017Qc9vv
Microsoft Research tweet media
English
2
34
39
9K
Xuan He retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
🔥 It’s time to bring RL to generative video evaluation! Introducing VideoScore2 — a model that not only generates scores for generative videos but also produces detailed, high-quality reasoning traces. 🚀 To build VideoScore2, we curated prompts from 5 sources, covering both general scenarios and complex ones like OCR, camera motion, and multi-action. Each video is evaluated along 3 aspects: visual quality, text alignment, and physical consistency. 👏 Our training dataset, VideoFeedback2, includes 2,933 unique prompts, 27,168 generated videos, and 81,504 scores with rationales. We used Claude-4-Sonnet to expand the annotated rationales and scores into detailed thinking traces, which were then used in the SFT stage to teach the model to produce long-CoT evaluations. We further applied GRPO-based RL training to align the traces with correct annotated scores. 📈 Results show strong competitiveness on our in-domain VideoScoreBench-V2, and superior performance across 4 out-of-domain video evaluation benchmarks — demonstrating VideoScore2’s potential for both evaluation and serving as a reward model for generative video.
English
2
8
28
14.7K
Xuan He retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and ToolRL. While these achieve impressive performance, their training codes are either not fully open-sourced or too difficult to modify and customize with new tools, creating unexpectedly high engineering costs for the community when exploring new ideas. To address these issues and reduce engineering overhead, we propose verl-tool. Key Features: 1. 🔧 Complete decoupling of actor rollout and environment interaction - We use verl as a submodule to benefit from ongoing verl repo updates. All tool calling is integrated via a unified API, allowing you to easily add new tools by simply adding a Python file and testing independently. 2. 🌍 Tool-as-environment paradigm - Each tool interaction can modify the environment state. We store and reload environment states for each trajectory. For each training, you can launch 3. ⚡ Native RL framework for tool-calling agents - verl-tool natively supports multi-turn interactive loops between agents and their tool environments. 4. 📊 User-friendly evaluation suite - Launch your trained model with OpenAI API alongside the tool server. Simply send questions and get final outputs with all interactions handled internally. We've successfully reproduced ToRL results using our verl-tool framework, demonstrating its correctness and demonstrating comparable performance on mathematical benchmarks. VerlTool is an active ongoing project! We aim to incorporate more tools covering a wide range of use cases and expect they can be trained together in a single framework. Suggestions and contributions are highly welcomed! Check out our GitHub: github.com/TIGER-AI-Lab/v… More details: 👇 (0/4)
Dongfu Jiang tweet media
English
6
73
383
79.7K
Xuan He retweetledi
Ke Yang
Ke Yang@EmpathYang·
Imagine AI assistants on your smart glasses or laptop proactively bridging your info gaps! 🗺️ Entering a building? Get instant floor plans for seamless navigation. 🧑‍💻 In lectures? Receive concise explanations to stay on track. Our new preprint introduces Just-In-Time Information Recommendation (JIR)—a new and promising information service paradigm that delivers contextually relevant info exactly when you need it. 🧐Key highlights:  -> Mathematical framework & evaluation metrics for JIR systems -> JIR-Arena: The first multimodal benchmark dataset with diverse, info-request-heavy scenarios -> Implements a prototypical JIR system and evaluates it for recall, precision, timeliness, & relevance 🎓 Baseline findings: JIR system with foundation models shows promising precision but struggles with recall & efficient content retrieval. 💼 Paper: arxiv.org/abs/2505.13550 💼 Code: github.com/EmpathYang/JIR…
Ke Yang tweet mediaKe Yang tweet mediaKe Yang tweet mediaKe Yang tweet media
English
0
14
16
2.4K
Xuan He retweetledi
Wenhu Chen
Wenhu Chen@WenhuChen·
Meet VideoScore, the FIRST fine-grained video reward/metric model! VideoScore can be used do rate synthesized videos or reward your T2V model through RLHF. We curate VideoFeedback, containing 37K human-annotated multi-aspect ratings for synthesized videos. VideoScore was trained on this dataset to simulate human judgement on five aspects including visual quality (VQ), temporal consistency (TC), dynamic degree (DD), text-alignment (TVA) and factualness (FC). VideoScore shows strong correlation with human raters. On four datasets: VideoFeedback-test, EvalCrafter (@yshan2u), VBench (@liuziwei7) and GenAI-bench, VideoScore can universally beat other metric model by a huge margin. Notably, it outperforms "gpt-4o as judge" by 50% on our eval set in terms of spearman correlation. Arxiv: arxiv.org/abs/2406.15252 We release everything including data and model. Website: tiger-ai-lab.github.io/VideoScore/ Demo: huggingface.co/spaces/TIGER-L… Work led by @XuanHe21 and @DongfuJiang. Lots of students contributed to the annotation pipeline. We also thank @datacurve for providing compute to us.
Dongfu Jiang@DongfuJiang

🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):

English
2
18
56
14.6K
Xuan He
Xuan He@XuanHe21·
Check our new work here! We propose📽️VideoScore, the first-ever fine-grained evaluation model/reward model for text-to-video (T2V) generation, and the dataset🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for T2V generations!
Dongfu Jiang@DongfuJiang

🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):

English
1
1
7
583
Xuan He retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):
Dongfu Jiang tweet media
English
4
16
69
17.5K
Xuan He retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
Can we enhance large multimodal models with multi-image reasoning ability via instruction tuning? 🔥Thrilled to announce Mantis! A family of Large Multimodal Models (LMMs) that supports interleaved text-image inputs format. Blog: tiger-ai-lab.github.io/Blog/mantis 👇
Dongfu Jiang tweet media
English
3
26
93
80K