
Vincent Sitzmann
893 posts

Vincent Sitzmann
@vincesitzmann
Teaching AI to model, see, and interact with our world. Assistant Professor @ MIT, leading the Scene Representation Group (https://t.co/h5gvhLYrtw).




Why do diffusion models produce new images instead of just memorizing the dataset? We show that they learn pixel correlation patterns from the data and therefore denoise locally, which promotes generalization. To test this idea, we compare trained diffusion models with a training-free algorithm that mixes local patches from the dataset. Surprisingly, this simple procedure already reproduces many properties of the trained models. 🧵 Check out this thread for more details about our Spotlight NeurIPS paper with @yuancy, @JustinMSolomon and @vincesitzmann.


Because we support long-context visual memory, our robots can learn on the fly. Show the robot a single human demonstration, and it understands both the intent and the motion. It can even extrapolate to novel objects and environments it's never seen before. 🧺✍️

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)


If you're at @wacv_official tomorrow come see our poster on Snapmoji! It's our system to generate and animate 3D avatars of yourself. I'll be at poster location 51 from 4:00-5:45 on Sunday!


Excited to show some surprising inventions on generative multiplayer games we made at Google with Stanford. We call the work MultiGen. I've always been inspired by early studios like id Software with Doom or Blizzard with Warcraft bringing networked video games to the next level. We are at the point in history where we can make strides like them, but for generative games. It's a strange feeling to be in the age of generative video games while still discovering how exactly to train the models and design the tools that make them useful. All of the tools that have been invented for classic game engines need to be redesigned for generative games. For example level and world design is not entirely possible with existing technology. We introduce editable memory to diffusion game engines that allow for design of new levels via a minimap. But we can easily imagine how this can be expanded with different creation tools. The end goal of this research direction is to allow game designers to be able to guide the generation process of their world, at the granularity that they prefer. Editable memory also allows us to add multiplayer to Generative Doom. We were amazed when we saw GameNGen some years ago, and now you can play it live with friends in real-time, on your couch or even online. Shared representations like our editable memory seem like the future for this type of experience. Models are, in some cases, expensive and approximate encoders but great interpolators and extrapolators. Leveraging their strengths lets you have completely new experiences that can be realized now and not in the distant future. This work was started at my previous team and continued in collaboration with Stanford. Congratulations to all for the discoveries.

4) Relative attention scales! Lastly, we found that the new relative attention schemes (PRoPE, GTA), bring significant scaling benefits in many-view settings (>= 4 view)—they change the slope of the Pareto frontier for both model families. (6/n)


How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)