Xuan He (@XuanHe21) - Twitter Profili | Zamantika Mersobahis Locabet

Xuan He@XuanHe21·4 May

Updates for the project PlugMem: 🎉Accepted to ICML! 💡Support integration for OpenClaw🦞, claude code🐙 and other framworks! See more details in the paper

Ke Yang@EmpathYang

Big PlugMem update 🧠 A plug-and-play memory module for LLM agents — turns raw trajectories into a knowledge graph your agent actually reasons over. 🎉 Accepted to ICML 2026 🔌 Drop it into OpenClaw 🦞, Claude Code, and other agent runtimes 🔍 Visualize memory · test retrieval · replay sessions 🥇 SOTA backbone on LongMemEval & HotpotQA — general enough to build on Paper: arxiv.org/abs/2603.03296 Code: github.com/TIMAN-group/Pl… #ICML2026 #LLM #Agents

English

0

1

8

62.8K

Xuan He retweetledi

Dongfu Jiang@DongfuJiang·9 Nis

How much of “video understanding” is actually… not about video? We found that 40–60% of questions in popular benchmarks (VideoMME, MMVU) can be answered without watching the video. And it gets worse as models scale. 🧵 This problem doesn’t just affect evaluation. It’s baked into post-training data. So when you do SFT / RL, a large portion of the “gain” actually comes from better language priors, not better visual grounding. We propose a simple fix: VidGround 👉 Filter out text-only-answerable questions 👉 Keep only visually grounded data That’s it. Surprisingly, less data works better: • Only 69.1% of the data • +6.2 pts improvement • Outperforms more complex RL pipelines Key takeaway: - If your data allows shortcutting, your model will learn shortcuts. - For video understanding: Grounding signal > data scale > algorithm tricks 📄 huggingface.co/papers/2604.05… 🌐 vidground.etuagi.com

English

3

6

22

1.2K

Xuan He retweetledi

Microsoft Research@MSFTResearch·10 Mar

PlugMem transforms AI agents’ interaction histories into structured, reusable knowledge. It integrates with any agent, supports diverse tasks and memory types, and maximizes decision quality while significantly reducing memory token use: msft.it/6017Qc9vv

English

2

34

39

9K

Xuan He@XuanHe21·23 Şub

New paper: PlugMem, task-agnostic and plug-in memory for LLM agents! PlugMem structures experience into compact, decision-relevant units—semantic propositions (“knowing that”) and procedural prescriptions (“knowing how”)—organized as a knowledge-centric memory graph.

Ke Yang@EmpathYang

📰New preprint: How can we build a task-agnostic plug-and-play memory module for LLM agents that supports multiple memory types? We present PlugMem🔌🧠, a plugin memory module that works across tasks by turning heterogeneous experience into knowledge. Evaluated unchanged on long-term dialogue🗣️, multi-hop QA🕵️, and web agents🕸️🤖, PlugMem improves performance while using far fewer memory tokens. 📜Paper: empathyang.github.io/files/PlugMem.… 🔨Code: github.com/TIMAN-group/Pl…

English

0

1

2

249

Xuan He retweetledi

Dongfu Jiang@DongfuJiang·1 Eki

🔥 It’s time to bring RL to generative video evaluation! Introducing VideoScore2 — a model that not only generates scores for generative videos but also produces detailed, high-quality reasoning traces. 🚀 To build VideoScore2, we curated prompts from 5 sources, covering both general scenarios and complex ones like OCR, camera motion, and multi-action. Each video is evaluated along 3 aspects: visual quality, text alignment, and physical consistency. 👏 Our training dataset, VideoFeedback2, includes 2,933 unique prompts, 27,168 generated videos, and 81,504 scores with rationales. We used Claude-4-Sonnet to expand the annotated rationales and scores into detailed thinking traces, which were then used in the SFT stage to teach the model to produce long-CoT evaluations. We further applied GRPO-based RL training to align the traces with correct annotated scores. 📈 Results show strong competitiveness on our in-domain VideoScoreBench-V2, and superior performance across 4 out-of-domain video evaluation benchmarks — demonstrating VideoScore2’s potential for both evaluation and serving as a reward model for generative video.

English

2

8

28

14.7K

Xuan He retweetledi

Dongfu Jiang@DongfuJiang·1 Haz

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and ToolRL. While these achieve impressive performance, their training codes are either not fully open-sourced or too difficult to modify and customize with new tools, creating unexpectedly high engineering costs for the community when exploring new ideas. To address these issues and reduce engineering overhead, we propose verl-tool. Key Features: 1. 🔧 Complete decoupling of actor rollout and environment interaction - We use verl as a submodule to benefit from ongoing verl repo updates. All tool calling is integrated via a unified API, allowing you to easily add new tools by simply adding a Python file and testing independently. 2. 🌍 Tool-as-environment paradigm - Each tool interaction can modify the environment state. We store and reload environment states for each trajectory. For each training, you can launch 3. ⚡ Native RL framework for tool-calling agents - verl-tool natively supports multi-turn interactive loops between agents and their tool environments. 4. 📊 User-friendly evaluation suite - Launch your trained model with OpenAI API alongside the tool server. Simply send questions and get final outputs with all interactions handled internally. We've successfully reproduced ToRL results using our verl-tool framework, demonstrating its correctness and demonstrating comparable performance on mathematical benchmarks. VerlTool is an active ongoing project! We aim to incorporate more tools covering a wide range of use cases and expect they can be trained together in a single framework. Suggestions and contributions are highly welcomed! Check out our GitHub: github.com/TIGER-AI-Lab/v… More details: 👇 (0/4)

English

6

73

383

79.7K

Xuan He retweetledi

Ke Yang@EmpathYang·21 May

Imagine AI assistants on your smart glasses or laptop proactively bridging your info gaps! 🗺️ Entering a building? Get instant floor plans for seamless navigation. 🧑‍💻 In lectures? Receive concise explanations to stay on track. Our new preprint introduces Just-In-Time Information Recommendation (JIR)—a new and promising information service paradigm that delivers contextually relevant info exactly when you need it. 🧐Key highlights: -> Mathematical framework & evaluation metrics for JIR systems -> JIR-Arena: The first multimodal benchmark dataset with diverse, info-request-heavy scenarios -> Implements a prototypical JIR system and evaluates it for recall, precision, timeliness, & relevance 🎓 Baseline findings: JIR system with foundation models shows promising precision but struggles with recall & efficient content retrieval. 💼 Paper: arxiv.org/abs/2505.13550 💼 Code: github.com/EmpathYang/JIR…

English

0

14

16

2.4K

Xuan He retweetledi

Wenhu Chen@WenhuChen·25 Haz

Meet VideoScore, the FIRST fine-grained video reward/metric model! VideoScore can be used do rate synthesized videos or reward your T2V model through RLHF. We curate VideoFeedback, containing 37K human-annotated multi-aspect ratings for synthesized videos. VideoScore was trained on this dataset to simulate human judgement on five aspects including visual quality (VQ), temporal consistency (TC), dynamic degree (DD), text-alignment (TVA) and factualness (FC). VideoScore shows strong correlation with human raters. On four datasets: VideoFeedback-test, EvalCrafter (@yshan2u), VBench (@liuziwei7) and GenAI-bench, VideoScore can universally beat other metric model by a huge margin. Notably, it outperforms "gpt-4o as judge" by 50% on our eval set in terms of spearman correlation. Arxiv: arxiv.org/abs/2406.15252 We release everything including data and model. Website: tiger-ai-lab.github.io/VideoScore/ Demo: huggingface.co/spaces/TIGER-L… Work led by @XuanHe21 and @DongfuJiang. Lots of students contributed to the annotation pipeline. We also thank @datacurve for providing compute to us.

Dongfu Jiang@DongfuJiang

🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):

English

2

18

56

14.6K

Xuan He@XuanHe21·25 Haz

🔎Our Website for VideoScore: VideoScore (tiger-ai-lab.github.io) 🤗Demo: huggingface.co/spaces/TIGER-L… 📃Paper: arxiv.org/abs/2406.15252

English

0

64

Xuan He@XuanHe21·25 Haz

Check our new work here! We propose📽️VideoScore, the first-ever fine-grained evaluation model/reward model for text-to-video (T2V) generation, and the dataset🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for T2V generations!

Dongfu Jiang@DongfuJiang

🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):

English

1

7

583

Xuan He retweetledi

Dongfu Jiang@DongfuJiang·25 Haz

🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations. 🤔Why VideoScore? 1. From the reward modeling perspective, we have seen the great success of RLHF approaches in the LLM, VLM, T2I models, etc. However, fine-grained human feedback for video generation (T2V) has yet to be included in the community. 2. From the evaluation perspective, those featured-based tools, like DINO, UMT, GRiT, etc. can hardly be regarded as representing human preference. 👏Then how do we fill this gap? The answer is clear and certain, collecting human feedback. This results in our 37.6K VideoFeedback dataset, where each example is rated across 5 dimensions, 1. Visual quality 2. Temporal consistency 3. Dynamic degree 4. Text-to-video alignment 5. Factual consistency. 📽️VideoScore is: 1. easy to use, with a few lines of code that are compatible with hugging face transformers. 2. well-aligned with human preference and even surpasses GPT-4o and Gemini by a large margin in our experiments. 3. fine-grained and able to output 5 key dimension scores for video evaluation. 🙌We believe this is a great contribution to the T2V generations community, and we expect this direction can be pushed forward. T2V models shouldn’t be left behind in the RLHF area. Work co-led by the amazing undergrad Xuan He from Thu and me. Check out our paper and demo below: 📎Paper: arxiv.org/abs/2406.15252 🤗Demo: huggingface.co/spaces/TIGER-L… More insights👇(0/7):

English

4

16

69

17.5K

Xuan He retweetledi

Dongfu Jiang@DongfuJiang·14 Nis

Can we enhance large multimodal models with multi-image reasoning ability via instruction tuning? 🔥Thrilled to announce Mantis! A family of Large Multimodal Models (LMMs) that supports interleaved text-image inputs format. Blog: tiger-ai-lab.github.io/Blog/mantis 👇

English

3

26

93

80K

Xuan He

Keşfet