Ce Zhang

32 posts

Ce Zhang

Ce Zhang

@cezhhh

CS phd student at UNC Chapel Hill.

Chapel Hill, NC Katılım Eylül 2023
78 Takip Edilen100 Takipçiler
Ce Zhang retweetledi
Gedas Bertasius
Gedas Bertasius@gberta227·
Hard to believe it’s been almost 5 years since I started at UNC. 2025 was an exciting year for our group! 🎓 My two PhD students—who joined me when I had an empty group—are graduating. Watching them grow into experts has been the best part of the job. 🏀 We are branching into Robotics & Sports (combining my personal passions with work!). 🎥 Our new video systems, BIMBA & SiLVR, achieved excellent performance across many challenging benchmarks. 🏆 Grateful for the awards we received across academia and industry this year. I used to worry about making it in academia. Now, I'm just happy to be here. Huge thanks to my group for an incredible 2025. Here is a snapshot of what we accomplished! 📸
Gedas Bertasius tweet media
English
3
7
51
3.7K
Ce Zhang retweetledi
Gedas Bertasius
Gedas Bertasius@gberta227·
Is language a "terrible abstraction" for video understanding? Many in the video community often dismiss language-driven approaches in favor of complex, video-native solutions. However, I believe this resistance stems more from internal bias—validating a research identity as a "vision/video researcher"—than from empirical reality. Simple, language-driven systems often dominate complex, video-native solutions. Over the last two years, our group at UNC has developed a series of such language-driven frameworks (LLoVi, VideoTree, VidAssist, and SiLVR) based on a simple, modular pipeline: Video Input → Dense Captioning → LLM Reasoning Empirically, these simple systems frequently outperform sophisticated video-focused solutions across numerous benchmarks while offering significant advantages: Scaling: By decoupling vision and reasoning, we can leverage the most powerful LLMs (even 1T+ parameters). Video-native approaches hit GPU memory walls, often limited to <256 frames, making long-form video analysis very difficult. Adaptability & Flexibility: Better captioners or LLMs (released very frequently these days) instantly improve the system with minimal effort. Training-Free: They work out-of-the-box and easily incorporate diverse data (bboxes, audio/speech, etc) as textual captions. This avoids the complex, resource-intensive training regimes used by many recent approaches (e.g., particularly RL-based video reasoning systems). While end-to-end systems may eventually prevail, these language-driven frameworks currently offer the most performant and practical approach to video reasoning. They shouldn’t be dismissed just because they are driven by language; they should be used as strong baselines to advance the field. SiLVR: arxiv.org/pdf/2505.24869 LLoVi: arxiv.org/pdf/2312.17235 VideoTree: arxiv.org/pdf/2405.19209 VidAssist: arxiv.org/pdf/2409.20557
English
2
4
20
1.6K
Ce Zhang
Ce Zhang@cezhhh·
Our framework offers several benefits. 1) Simplicity. No complex RL-based optimization or specialized modules for different tasks. 2) Generalizability. Can be applied to a wide range of complex video-language tasks. 3) Modularity. Enables seamless use of visual captioning models and LLMs.
English
1
0
0
163
Ce Zhang
Ce Zhang@cezhhh·
Recent advances in test-time optimization have led to remarkable reasoning capabilities in LLMs. However, the reasoning capabilities of MLLMs still significantly lag, especially for complex video-language tasks. We present SiLVR, a Simple Language-based Video Reasoning framework.
Ce Zhang tweet mediaCe Zhang tweet media
English
1
10
27
6.3K
Ce Zhang retweetledi
Yulu Pan
Yulu Pan@YuluPan_00·
🚨 New #CVPR2025 Paper 🚨 🏀BASKET: A Large-Scale Dataset for Fine-Grained Basketball Skill Estimation🎥 4,477 hours of videos⏱️ | 32,232 players⛹️ | 20 fine-grained skills🎯 We present a new video dataset for skill estimation with unprecedented scale and diversity! A thread👇
English
2
4
21
9.5K
Ce Zhang retweetledi
Kevin Zhao
Kevin Zhao@KevinZ8866·
(0/7) #ICLR2024 How could LLM benefit video action forecasting? Excited to share our ICLR 2024 paper: AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
AK@_akhaliq

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? paper page: huggingface.co/papers/2307.16… Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis.

English
2
2
5
3.7K