sriya
5 posts











Excited to share our new survey paper: the first comprehensive survey on Vision World Model (VWM), a joint effort by researchers from BJTU, ByteDance, Tencent, NUS, and more. 🌟 From Seeing to Knowing the World: A Survey of Vision World Models 🚀 Our core message is a paradigm shift toward vision-centric world modeling: Vision should not be treated merely as an input modality. It should be the primary driver of how world models are represented, learned, and evaluated. 🌈 This is also the longstanding view behind our #VideoWorld series: learning directly from visual observation and interaction offers a scalable path for AI agents to acquire world knowledge, laying the foundation for higher machine intelligence. 🤔 Why Vision World Models? From biological evolution to human intelligence, vision has been central to learning about the world through observation and interaction. AI should have this capability too. This motivates Vision World Models: models that learn world knowledge from visual data and simulate future world states conditioned on interaction. 🤖 In this survey, we thoroughly review 400+ recent papers and provide a vision-centric roadmap for Vision World Models, covering architectures, functional roles, applications, evaluation protocols, datasets, benchmarks, and future outlook. Key takeaways: 1️⃣ Vision is a fundamental basis of intelligence and a rich source of world knowledge. We advocate vision-centric world modeling, where AI learns the physical and causal principles behind world evolution from visual data. 2️⃣ We propose a unified framework that decomposes Vision World Models into three core components: Vision Encoding → Knowledge Learning → Controllable Simulation and organize current methods into 4 major families and 7 representative architectures. 3️⃣ We review evaluation from three levels: Visual Quality, Physical Plausibility, and Task Performance, and group datasets/benchmarks into foundational world modeling and domain-specific world modeling. 4️⃣ We outline three directions for next-generation world models: Re-grounding in physical and causal knowledge, Re-evaluating beyond visual appearance, and Re-scaling toward generalist, reliable, and interaction-aware world models. Check out our paper and the continuously updated curated list of Vision World Model papers for more details! 📄 Paper: aiworldlab.github.io/survey/preprin… 🌐 Project Page: aiworldlab.github.io/survey/ 📚 Curated VWM Paper List: github.com/AIWorldLab/Awe #VisionWorldModel #WorldModel #Survey #VideoWorld #EmbodiedAI #Robotics #AI #CV
