Chaoyou Fu

18 posts

Chaoyou Fu banner
Chaoyou Fu

Chaoyou Fu

@brady202406

Assistant Professor @NJU1902. Lead MME, VITA, Awesome-MLLM.

Katılım Haziran 2024
6 Takip Edilen14 Takipçiler
Chaoyou Fu retweetledi
Chaoyou Fu
Chaoyou Fu@brady202406·
@Chain_AlphaX @_akhaliq 😉 Yeap. We aim to tackle the problem (benchmark scores are getting higher, but real understanding is not).
English
0
0
0
2
Chain Alpha
Chain Alpha@Chain_AlphaX·
@_akhaliq woah, video understanding is getting serious now .
English
1
0
1
6
Chaoyou Fu retweetledi
AK
AK@_akhaliq·
Video-MME-v2 Towards the Next Stage in Benchmarks for Comprehensive Video Understanding paper: huggingface.co/papers/2604.05…
AK tweet media
English
2
9
33
6.6K
Chaoyou Fu
Chaoyou Fu@brady202406·
@khalide_f1 @_akhaliq Exactly — that’s also what we observed. Current reasoning models rely heavily on textual cues. With subtitles, “thinking” tends to help. But without text, adding “thinking” can even hurt performance. It suggests that visual reasoning is still far from mature.
Chaoyou Fu tweet media
English
0
0
0
8
AI Expert Khalid
AI Expert Khalid@khalide_f1·
@_akhaliq That subtitle dependency is the real tell. If performance drops without text cues, the model is not really understanding the video yet. Feels like we are still early in true visual reasoning for AI 🤔
English
1
0
1
54
Chaoyou Fu retweetledi
Chaoyou Fu retweetledi
Yuhao Dong
Yuhao Dong@dyhTHU·
🔥 Excited to share Video-MME-v2! 🔥 We built it to tackle a growing issue: video understanding benchmarks are getting saturated. 🏃🏻 Over 3,300 human-hours, nearly a year of effort 🌟 A new design with a progressive hierarchy + group-based nonlinear evaluation What we found: 👉 Human: 90.7 vs 👉 Gemini-3-Pro: 49.4 The gap is still huge. Explore More at: Page: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015
Yuhao Dong tweet media
English
2
8
24
5.9K
Chaoyou Fu retweetledi
Lei Li
Lei Li@_TobiasLee·
📣🔥 Video-MME-v2 is here! 🎯 Tackling the saturation of video understanding benchmarks 🚀 Built with 3,300+ human-hours over nearly a year 🔍 Progressive tri-level hierarchy & group-based nonlinear scoring 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 Project: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015
Lei Li tweet media
English
1
7
31
3.3K
Chaoyou Fu
Chaoyou Fu@brady202406·
🔥🔥 Sharing our work: Video-MME-v2! A team of 60+ amazing colleagues spent nearly a year building Video-MME-v2. 🤔 Due the existing saturation problem! 🚀 3,300+ human-hours 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 ❗A substantial gap! Project: video-mme-v2.netlify.app
Chaoyou Fu tweet media
English
2
3
6
172
Chaoyou Fu retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Video-MME-v2 A new benchmark for video understanding featuring a progressive tri-level hierarchy and grouped non-linear scoring. Built with 3,300 human-hours across 800 videos to expose gaps between leaderboard scores and true model capabilities.
DailyPapers tweet media
English
2
9
30
2.2K
Chaoyou Fu
Chaoyou Fu@brady202406·
@_TobiasLee Video-MME v2 is about to be released. Brand-new eval ppl, reflecting our in-depth thinking on videounderstanding of videos.😉
English
0
0
0
7
Lei Li
Lei Li@_TobiasLee·
MMMU-Pro and Video-MME are saturated now. We need some awesome agentic multimodal benchmarks. Any suggestions?
English
8
0
20
3.2K
Chaoyou Fu retweetledi
Yuhao Dong
Yuhao Dong@dyhTHU·
📢ICLR2026 Acceptance Prediction is out! 🚀Find the acceptance of your paper in advance (predicted): paperdecision.netlify.app 🛠️Code of Multi-Agent Framework and Benchmark is available: github.com/PaperDecision/… 🎯Our goal is to understand the how and why behind paper decisions.
Yuhao Dong tweet media
English
12
15
154
31.6K
Chaoyou Fu retweetledi
AK
AK@_akhaliq·
VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: huggingface.co/papers/2408.05… The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.
English
2
40
153
24K
Chaoyou Fu retweetledi
AK
AK@_akhaliq·
Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus
AK tweet media
English
2
35
113
69.8K
Chaoyou Fu retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Gemini 1.5 Pro far outperforms other models, including GPT4o proj: video-mme.github.io abs: arxiv.org/abs/2405.21075
Aran Komatsuzaki tweet media
English
4
57
213
86.3K
Chaoyou Fu retweetledi
AK
AK@_akhaliq·
paper page: buff.ly/4bKGD6M
English
1
4
8
7.2K