

Yanshu Zhang
11 posts




🎥 Video is already a tough modality for reasoning. Egocentric video? Even tougher! It is longer, messier, and harder. 💡 How do we tackle these extremely long, information-dense sequences without exhausting GPU memory or hitting API limits? We introduce 👓Ego-R1: A framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, with the support from Chain-of-Tool-Thought (CoTT) that decomposes complex reasoning tasks into modular steps. At its core is Ego-R1-Agent-3B, an orchestrating language model trained to dynamically invoke specialized tools at each step, based on the previous actions and observations, to collect the necessary information and solve the tasks gradually, step-by-step. All code and data are fully open-sourced :) 🌐 Project: egolife-ai.github.io/Ego-R1 📄 Paper: arxiv.org/abs/2506.13654 💻 Code: github.com/egolife-ai/Ego…





NeRF reconstructs 3D scenes accurately, but editing them is hard. Introducing PAPR, a method for learning a point cloud from multiple views from scratch and enables zero-shot editing. Details at zvict.github.io/papr/. Joint work w/ @yszhang170, @PengShichong & @Moazeni_Alireza

NeRF reconstructs 3D scenes accurately, but editing them is hard. Introducing PAPR, a method for learning a point cloud from multiple views from scratch and enables zero-shot editing. Details at zvict.github.io/papr/. Joint work w/ @yszhang170, @PengShichong & @Moazeni_Alireza

NeRF reconstructs 3D scenes accurately, but editing them is hard. Introducing PAPR, a method for learning a point cloud from multiple views from scratch and enables zero-shot editing. Details at zvict.github.io/papr/. Joint work w/ @yszhang170, @PengShichong & @Moazeni_Alireza
