Ce Zhang

32 posts

Ce Zhang

@cezhhh

CS phd student at UNC Chapel Hill.

Chapel Hill, NC Katılım Eylül 2023

78 Takip Edilen100 Takipçiler

Ce Zhang retweetledi

Gedas Bertasius@gberta227·22 Ara

Hard to believe it’s been almost 5 years since I started at UNC. 2025 was an exciting year for our group! 🎓 My two PhD students—who joined me when I had an empty group—are graduating. Watching them grow into experts has been the best part of the job. 🏀 We are branching into Robotics & Sports (combining my personal passions with work!). 🎥 Our new video systems, BIMBA & SiLVR, achieved excellent performance across many challenging benchmarks. 🏆 Grateful for the awards we received across academia and industry this year. I used to worry about making it in academia. Now, I'm just happy to be here. Huge thanks to my group for an incredible 2025. Here is a snapshot of what we accomplished! 📸

English

3.7K

Ce Zhang retweetledi

Gedas Bertasius@gberta227·31 Eki

Is language a "terrible abstraction" for video understanding? Many in the video community often dismiss language-driven approaches in favor of complex, video-native solutions. However, I believe this resistance stems more from internal bias—validating a research identity as a "vision/video researcher"—than from empirical reality. Simple, language-driven systems often dominate complex, video-native solutions. Over the last two years, our group at UNC has developed a series of such language-driven frameworks (LLoVi, VideoTree, VidAssist, and SiLVR) based on a simple, modular pipeline: Video Input → Dense Captioning → LLM Reasoning Empirically, these simple systems frequently outperform sophisticated video-focused solutions across numerous benchmarks while offering significant advantages: Scaling: By decoupling vision and reasoning, we can leverage the most powerful LLMs (even 1T+ parameters). Video-native approaches hit GPU memory walls, often limited to <256 frames, making long-form video analysis very difficult. Adaptability & Flexibility: Better captioners or LLMs (released very frequently these days) instantly improve the system with minimal effort. Training-Free: They work out-of-the-box and easily incorporate diverse data (bboxes, audio/speech, etc) as textual captions. This avoids the complex, resource-intensive training regimes used by many recent approaches (e.g., particularly RL-based video reasoning systems). While end-to-end systems may eventually prevail, these language-driven frameworks currently offer the most performant and practical approach to video reasoning. They shouldn’t be dismissed just because they are driven by language; they should be used as strong baselines to advance the field. SiLVR: arxiv.org/pdf/2505.24869 LLoVi: arxiv.org/pdf/2312.17235 VideoTree: arxiv.org/pdf/2405.19209 VidAssist: arxiv.org/pdf/2409.20557

English

1.6K

Ce Zhang@cezhhh·4 Haz

@wenhaocha1 Thanks for promoting our work!

English

Wenhao Chai@wenhaocha1·4 Haz

great works! RL-powered reasoning LLM actually start to solve all the video tasks. It improved more on VideoMMLU than I expected!

Ce Zhang@cezhhh

Recent advances in test-time optimization have led to remarkable reasoning capabilities in LLMs. However, the reasoning capabilities of MLLMs still significantly lag, especially for complex video-language tasks. We present SiLVR, a Simple Language-based Video Reasoning framework.

English

1.1K

Ce Zhang@cezhhh·3 Haz

Work done with amazing collaborators: @yblin98 @ZiyangW00 @mohitban47 @gberta227 Paper: arxiv.org/pdf/2505.24869 Project Page: sites.google.com/cs.unc.edu/sil… Code: github.com/CeeZh/SILVR

English

180

Ce Zhang@cezhhh·3 Haz

Our framework offers several benefits. 1) Simplicity. No complex RL-based optimization or specialized modules for different tasks. 2) Generalizability. Can be applied to a wide range of complex video-language tasks. 3) Modularity. Enables seamless use of visual captioning models and LLMs.

English

163

Ce Zhang@cezhhh·3 Haz

English

6.3K

Ce Zhang retweetledi

Yulu Pan@YuluPan_00·28 Mar

🚨 New #CVPR2025 Paper 🚨 🏀BASKET: A Large-Scale Dataset for Fine-Grained Basketball Skill Estimation🎥 4,477 hours of videos⏱️ | 32,232 players⛹️ | 20 fine-grained skills🎯 We present a new video dataset for skill estimation with unprecedented scale and diversity! A thread👇

English

9.5K

Ce Zhang@cezhhh·12 Kas

Excited to share that LLoVi is accepted to #EMNLP2024. We will present our work in poster session 12, Nov. 14 (Thu.) 14:00-15:30 ET. Happy to have a chat! Check out our paper at: arxiv.org/pdf/2312.17235 Code: github.com/CeeZh/LLoVi Website: sites.google.com/cs.unc.edu/llo…

Ce Zhang@cezhhh

First, LLoVi uses a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the short-term captions to perform long-range temporal reasoning.

English

2.3K

Ce Zhang retweetledi

Kevin Zhao@KevinZ8866·2 May

(0/7) #ICLR2024 How could LLM benefit video action forecasting? Excited to share our ICLR 2024 paper: AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

AK@_akhaliq

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? paper page: huggingface.co/papers/2307.16… Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis.

English

3.7K

Ce Zhang retweetledi

Gedas Bertasius@gberta227·5 Nis

The 3rd Transformers for Vision workshop will be back at #CVPR2024! We have a great speaker lineup covering diverse Transformer topics! Papers Due: Apr 15. Website: sites.google.com/view/t4v-cvpr24 Organized w/ @_rohitgirdhar_, @ZhidingYu, @giffmana, @gulvarol, @mohitban47 and others!

English

19.5K

Ce Zhang retweetledi

Gedas Bertasius@gberta227·22 Oca

The code is now publicly available at github.com/CeeZh/LLoVi!

Gedas Bertasius@gberta227

Check out our recent work on long range video understanding using LLMs! Our simple framework, dubbed LLoVi outperforms prior approaches on the new EgoSchema long range videoQA benchmark by 18% (absolute gain). More details 👇

English

Keşfet

@wenhaocha1 @yblin98 @ZiyangW00 @mohitban47 @gberta227 @_rohitgirdhar_ @ZhidingYu @giffmana