Chaoyou Fu

7

37

4.5K

Chaoyou Fu@brady202406·9 Nis

@Chain_AlphaX @_akhaliq 😉 Yeap. We aim to tackle the problem (benchmark scores are getting higher, but real understanding is not).

English

2

Chain Alpha@Chain_AlphaX·8 Nis

@_akhaliq woah, video understanding is getting serious now .

English

0

1

6

Chaoyou Fu retweetledi

AK@_akhaliq·8 Nis

Video-MME-v2 Towards the Next Stage in Benchmarks for Comprehensive Video Understanding paper: huggingface.co/papers/2604.05…

English

9

33

6.6K

Chaoyou Fu@brady202406·9 Nis

@khalide_f1 @_akhaliq Exactly — that’s also what we observed. Current reasoning models rely heavily on textual cues. With subtitles, “thinking” tends to help. But without text, adding “thinking” can even hurt performance. It suggests that visual reasoning is still far from mature.

English

8

AI Expert Khalid@khalide_f1·8 Nis

@_akhaliq That subtitle dependency is the real tell. If performance drops without text cues, the model is not really understanding the video yet. Feels like we are still early in true visual reasoning for AI 🤔

English

📣🔥 Video-MME-v2 is here! 🎯 Tackling the saturation of video understanding benchmarks 🚀 Built with 3,300+ human-hours over nearly a year 🔍 Progressive tri-level hierarchy & group-based nonlinear scoring 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 Project: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

0

1

54

Chaoyou Fu retweetledi

Lei Li@_TobiasLee·8 Nis

gemini 3 pro @GeminiApp leads, followed by Doubao, Qwen 3.5 @Alibaba_Qwen and Xiaomi MiMo V2 omni @XiaomiMiMo

Lei Li@_TobiasLee

Filipino

1

3

1.3K

Chaoyou Fu@brady202406·8 Nis

@dyhTHU 😉

QME

1

105

Chaoyou Fu retweetledi

Yuhao Dong@dyhTHU·8 Nis

🔥 Excited to share Video-MME-v2! 🔥 We built it to tackle a growing issue: video understanding benchmarks are getting saturated. 🏃🏻 Over 3,300 human-hours, nearly a year of effort 🌟 A new design with a progressive hierarchy + group-based nonlinear evaluation What we found: 👉 Human: 90.7 vs 👉 Gemini-3-Pro: 49.4 The gap is still huge. Explore More at: Page: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

English

8

24

5.9K

Chaoyou Fu@brady202406·8 Nis

@_TobiasLee 😉 Thanks, bro!

English

0

1

107

Chaoyou Fu retweetledi

Lei Li@_TobiasLee·8 Nis

📣🔥 Video-MME-v2 is here! 🎯 Tackling the saturation of video understanding benchmarks 🚀 Built with 3,300+ human-hours over nearly a year 🔍 Progressive tri-level hierarchy & group-based nonlinear scoring 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 Project: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

English

7

31

3.3K

Chaoyou Fu@brady202406·8 Nis

🔥🔥 Sharing our work: Video-MME-v2！ A team of 60+ amazing colleagues spent nearly a year building Video-MME-v2. 🤔 Due the existing saturation problem! 🚀 3,300+ human-hours 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 ❗A substantial gap! Project: video-mme-v2.netlify.app

English

3

6

172

Chaoyou Fu retweetledi

DailyPapers@HuggingPapers·8 Nis

Video-MME-v2 A new benchmark for video understanding featuring a progressive tri-level hierarchy and grouped non-linear scoring. Built with 3,300 human-hours across 800 videos to expose gaps between leaderboard scores and true model capabilities.

English

9

30

2.2K

Chaoyou Fu@brady202406·19 Oca

@_TobiasLee Video-MME v2 is about to be released. Brand-new eval ppl, reflecting our in-depth thinking on videounderstanding of videos.😉

English

7

Lei Li@_TobiasLee·15 Oca

MMMU-Pro and Video-MME are saturated now. We need some awesome agentic multimodal benchmarks. Any suggestions?

English

8

0

20

3.2K

Chaoyou Fu retweetledi

Yuhao Dong@dyhTHU·19 Oca

📢ICLR2026 Acceptance Prediction is out! 🚀Find the acceptance of your paper in advance (predicted): paperdecision.netlify.app 🛠️Code of Multi-Agent Framework and Benchmark is available: github.com/PaperDecision/… 🎯Our goal is to understand the how and why behind paper decisions.

English

12

15

154

31.6K

Chaoyou Fu retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·12 Ağu

VITA: Towards Open-Source Interactive Omni Multimodal LLM abs: arxiv.org/abs/2408.05211 project page: vita-home.github.io Model and code coming soon Omnimodal (video, image, text, audio), multilingual model with GPT-4o-like experience

English

7

41

163

14.2K

Chaoyou Fu retweetledi

AK@_akhaliq·12 Ağu

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: huggingface.co/papers/2408.05… The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

English

40

153

24K

Chaoyou Fu retweetledi

AK@_akhaliq·3 Haz

Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus

English

35

113

69.8K

Chaoyou Fu retweetledi

Aran Komatsuzaki@arankomatsuzaki·3 Haz

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Gemini 1.5 Pro far outperforms other models, including GPT4o proj: video-mme.github.io abs: arxiv.org/abs/2405.21075

English

4

57

213

86.3K

Chaoyou Fu retweetledi

AK@_akhaliq·3 Haz

paper page: buff.ly/4bKGD6M

English