Vincent Sitzmann

893 posts

Vincent Sitzmann

Vincent Sitzmann

@vincesitzmann

Teaching AI to model, see, and interact with our world. Assistant Professor @ MIT, leading the Scene Representation Group (https://t.co/h5gvhLYrtw).

Cambridge, Massachusetts Katılım Şubat 2016
308 Takip Edilen18.3K Takipçiler
Sabitlenmiş Tweet
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…
English
43
157
1K
365.7K
Seungwook Han
Seungwook Han@seungwookh·
Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)
Seungwook Han tweet media
English
48
259
1.7K
239.9K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
I.e., can we analyze what the neural network learns as a function of the dataset statistics? This paper is one such example, and it's really fun :)
English
0
0
4
553
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
There are lots of open questions in the quest for understanding generative models and diffusion models. One of our core lessons is that "mechanistic" interpretability is difficult, instead, I would advocate for "information-theoretic" interpretability... (3/n)
English
1
0
7
1.2K
Vincent Sitzmann retweetledi
Eric Chan
Eric Chan@ericryanchan·
Today, we announce our team’s progress in pursuing a different type of foundation model for robotics: the Direct Video Action Model (DVA), which does our best to take robotics and turn it into a generative modeling problem we can scale. Technical blog: rhoda.ai/research/direc…
English
12
29
197
18.9K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
Disclaimer: I'm not associated with Rhoda, but have lots of friends there, and may join in an advisory capacity to push video generative modeling for robotics further!
English
0
0
3
651
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
These are very impressive results! The Rhoda team has decisively gotten "video models for robotics" to work. They train a generalist real-time, causal video model that they then quickly fine-tune using task-specific data to generate video plans (1/n)
Rhoda AI@rhoda_ai_

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

English
1
1
39
7.1K
Vincent Sitzmann retweetledi
Nataniel Ruiz
Nataniel Ruiz@natanielruizg·
Excited to show some surprising inventions on generative multiplayer games we made at Google with Stanford. We call the work MultiGen. I've always been inspired by early studios like id Software with Doom or Blizzard with Warcraft bringing networked video games to the next level. We are at the point in history where we can make strides like them, but for generative games. It's a strange feeling to be in the age of generative video games while still discovering how exactly to train the models and design the tools that make them useful. All of the tools that have been invented for classic game engines need to be redesigned for generative games. For example level and world design is not entirely possible with existing technology. We introduce editable memory to diffusion game engines that allow for design of new levels via a minimap. But we can easily imagine how this can be expanded with different creation tools. The end goal of this research direction is to allow game designers to be able to guide the generation process of their world, at the granularity that they prefer. Editable memory also allows us to add multiplayer to Generative Doom. We were amazed when we saw GameNGen some years ago, and now you can play it live with friends in real-time, on your couch or even online. Shared representations like our editable memory seem like the future for this type of experience. Models are, in some cases, expensive and approximate encoders but great interpolators and extrapolators. Leveraging their strengths lets you have completely new experiences that can be realized now and not in the distant future. This work was started at my previous team and continued in collaboration with Stanford. Congratulations to all for the discoveries.
English
32
79
570
98.2K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
If you're at WACV, catch my student Eric - he is absolutely brilliant :)
Eric Ming Chen@ericmchen1

If you're at @wacv_official tomorrow come see our poster on Snapmoji! It's our system to generate and animate 3D avatars of yourself. I'll be at poster location 51 from 4:00-5:45 on Sunday!

English
0
0
34
5.4K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
@natanielruizg Ha also I didn't take a look at the author's list at first, this was clearly a group effort! Really cool work, Nataniel!!
English
0
0
0
108
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
Very cool work from Gordon's group (where I did my PhD)! We have also been thinking about how to use video generative models for video games, and believe that while we have to re-think how editability, control etc work, there is space for entirely new workflows here!
Nataniel Ruiz@natanielruizg

Excited to show some surprising inventions on generative multiplayer games we made at Google with Stanford. We call the work MultiGen. I've always been inspired by early studios like id Software with Doom or Blizzard with Warcraft bringing networked video games to the next level. We are at the point in history where we can make strides like them, but for generative games. It's a strange feeling to be in the age of generative video games while still discovering how exactly to train the models and design the tools that make them useful. All of the tools that have been invented for classic game engines need to be redesigned for generative games. For example level and world design is not entirely possible with existing technology. We introduce editable memory to diffusion game engines that allow for design of new levels via a minimap. But we can easily imagine how this can be expanded with different creation tools. The end goal of this research direction is to allow game designers to be able to guide the generation process of their world, at the granularity that they prefer. Editable memory also allows us to add multiplayer to Generative Doom. We were amazed when we saw GameNGen some years ago, and now you can play it live with friends in real-time, on your couch or even online. Shared representations like our editable memory seem like the future for this type of experience. Models are, in some cases, expensive and approximate encoders but great interpolators and extrapolators. Leveraging their strengths lets you have completely new experiences that can be realized now and not in the distant future. This work was started at my previous team and continued in collaboration with Stanford. Congratulations to all for the discoveries.

English
2
3
92
12.1K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
Evan worked closely with @RyuHyunwoooo on this, who had discovered the basis of this idea as a course project in our vision class!
English
1
0
6
1.9K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
Evan is an undergraduate researcher in my group, and within less than a year put together a really cool paper on the scaling laws of novel view synthesis - surprisingly, he found an encoder-decoder model that actually scales *better* than a decoder-only LVSM model!
Evan Kim@evnkimm

How do you train compute-optimal novel view synthesis models? In our CVPR ‘26 paper Scaling View Synthesis Transformers, we uncover key design choices through scaling and careful ablations--and along the way train a new SoTA with 3x less compute. (1/n)

English
2
6
191
21.3K