adil meric
31 posts





📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! ziyaerkoc.com/worldagents youtu.be/Mj2FqqhurdI Great work by @ErkocZiya @angelaqdai

🎬 2026 will be the year of autoregressive video models. As we wrap up 2025, we ask a critical question: How far can a diffusion model distilled with Self-Forcing on only 5-second, 16-FPS videos be pushed into long-form video generation without any supervision? ✨We introduce Infinity-RoPE, a training-free, plug-and-play relativistic RoPE formulation compatible with any Self-Forcing variant performing self-rollouts. ⏳Infinity-RoPE enables long video generation far beyond the base model’s temporal RoPE limit, supports full action control and dynamic scene changes, including scene cuts, within a single continuous generation stream.

🚀 Announcing Echo — our new frontier model for 3D world generation. Echo turns a simple text prompt or image into a fully explorable, 3D-consistent world. Instead of disconnected views, the result is a single, coherent spatial representation you can move through freely. This is part of a bigger shift in AI: from generating pixels and tokens to generating spaces. Echo predicts a geometry-grounded 3D scene at metric scale, meaning every novel view, depth map, and interaction comes from the same underlying world — not independent hallucinations. Once generated, the world is interactive in real time. You control the camera, explore from any angle, and render instantly — even on low-end hardware, directly in the browser. High-quality 3D world exploration is no longer gated by expensive equipment. Under the hood, Echo infers a physically grounded 3D representation and converts it into a renderable format. For our web demo, we use 3D Gaussian Splatting (3DGS) for fast, GPU-friendly rendering — but the representation itself is flexible and can be easily adapted. Why this matters: consistent 3D worlds unlock real workflows — digital twins, 3D design, game environments, robotics simulation, and more. From a single photo or a line of text, Echo builds worlds that are reliable, editable, and spatially faithful. Echo also enables scene editing and restyling. Change materials, remove or add objects, explore design variations — all while preserving global 3D consistency. Editing no longer breaks the world. This is only the beginning. Echo is the foundation for future world models with dynamics, physical reasoning, and richer interaction — environments that don’t just look right, but behave right. Explore the generated worlds on our website and sign up for the closed beta. The era of spatial intelligence starts here. 🌍 #Echo #WorldModels #SpatialAI #3DFoundationModels Check it out: spaitial.ai




Thrilled to announced that at #ICCV2025 we will host the first workshop on 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐲-𝐅𝐫𝐞𝐞 𝐍𝐨𝐯𝐞𝐥 𝐕𝐢𝐞𝐰 𝐒𝐲𝐧𝐭𝐡𝐞𝐬𝐢𝐬 𝐚𝐧𝐝 𝐂𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐚𝐛𝐥𝐞 𝐕𝐢𝐝𝐞𝐨 𝐌𝐨𝐝𝐞𝐥𝐬 geofreenvs.github.io a.k.a. "3D Computer Vision in the era of Video Models" 😅



Make no mistake, Google's new Veo 3 video generatiom model is absolutely exceptional. But gymnastics is still pure nightmare fuel, and is the Turing test for video models!






G3DST: Generalizing 3D Style Transfer with NeRF across Scenes and Styles! Given a style latent, our hypernetwork estimates MLP params that transform aggregated ray features. mericadil.github.io/G3DST/ Great work by our MA student @adilmeric12 U. Kocasari @barbara_roessle #GCPR24






