Vincent Sitzmann

905 posts

Vincent Sitzmann

Vincent Sitzmann

@vincesitzmann

Building AI that learns by interacting with the world. Assistant Professor @ MIT, leading the Scene Representation Group (https://t.co/h5gvhLYZj4).

Cambridge, Massachusetts Katılım Şubat 2016
311 Takip Edilen18.7K Takipçiler
Sabitlenmiş Tweet
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…
English
43
164
1K
383.4K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
@keenanisalive @rms80 @jon_barron ...there could be so many upsides! So many things are difficult to do with meshes and game engines that could be easy with these models. It also seems that "speed" as an advantage ultimately is eroded by compute and architecture advances...?
English
1
0
5
414
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
@keenanisalive @rms80 @jon_barron Ah sorry, I misunderstood! For content delivery, I am more amenable to 3D reps :) However, I don't think it's crazy that pixel-generating models may surprise us in the end. They need not be "general", they could be game-specific, interacting with a small, hard-coded game state...
English
1
0
9
467
Jon Barron
Jon Barron@jon_barron·
@rms80 We're still in the Crash Bandicoot stage right now, but it's hard to imagine the tech saturating at this level.
English
2
1
13
2.7K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
@keenanisalive @rms80 @jon_barron I think it is unlikely that self-driving will leverage explicit 3D even in the relatively short-term (5 years from now). Further, robotics has tried exactly that forever, and it never yielded generality or robustness. Why would it be different now?
English
1
0
6
723
Keenan Crane
Keenan Crane@keenanisalive·
Correct. This is the usual pendulum of dogma, just like we saw with autonomous driving. First, the claim is everything is going to work end-to-end. The pure video era. Just believe in the bitter lesson, and wait 18 months. We promise. When that fails, we go hybrid. All we need is a little bit of 3D. But honestly this is just to boost performance of a model that we still believe would have worked with enough scaling. Just give us another 18 months. Then investors get impatient, and the smaller startups close up shop or get acqui-hired. 18 months pass. The survivors keep grinding, and stop caring about the dogma of how things get built—focusing more on delivering something of value, by any means. The Waymo era. My prediction: Waymo era of video-based (and even 3D Gaussian-based) world models is to simply distill the output of these models back down to conventional, explicit 3D representations like animated meshes and textures, with a few splats here and there when you really need them. That’s how you deliver at scale, and for any specific use case, you don’t need to be running the full foundation model in real time.
English
5
15
95
8.5K
Pulkit Agrawal
Pulkit Agrawal@pulkitology·
Eka means unity -- “one,” in Sanskrit and “first” in Finnish. We’re building intelligence for the physical world in its native language: forces. Until now, robotics faced a tradeoff — generality or speed. The real world requires both. Robotics also faced a data problem. Our Vision–Force–Action (VFA) model — the first of its kind — breaks the generality-speed tradeoff and the data barrier. It's a new foundation uniting performance, generality, and safety for putting capable robots in everyone's hands. Today, I am excited to share our journey of pushing robots beyond human limits. Today, dexterity becomes scalable. Today, I welcome you to the Era of Eka. Co-founded with @haarnoja, and so thrilled and grateful to be working with a dream team at @EkaRobotics. Learn more: ekarobotics.com
English
65
221
2K
315.6K
Vincent Sitzmann retweetledi
Jon Barron
Jon Barron@jon_barron·
I’m biased of course, but I’m particularly pleased with the depth <-> RGB bijection we came up with for monocular depth estimation, which arose from thinking about space filling curves and the power transform work that fell out of Zip-NeRF. And how cool is this figure?
Jon Barron tweet media
English
7
9
174
10.2K
Jon Barron
Jon Barron@jon_barron·
We have an important result to share: if you reduce multiple dense vision tasks into a single RGB-image-prediction task, fine-tuning a strong image generator (in our case Nano Banana Pro) matches or beats all specialized models for monodepth, normals, and semantic segmentation.
Songyou Peng@songyoupeng

Yay, finally! Introducing Vision Banana🍌 from @GoogleDeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: vision-banana.github.io (1/5)

English
11
45
488
66.4K
Seungwook Han
Seungwook Han@seungwookh·
Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)
Seungwook Han tweet media
English
47
261
1.7K
253.7K
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
I.e., can we analyze what the neural network learns as a function of the dataset statistics? This paper is one such example, and it's really fun :)
English
0
1
5
875
Vincent Sitzmann
Vincent Sitzmann@vincesitzmann·
There are lots of open questions in the quest for understanding generative models and diffusion models. One of our core lessons is that "mechanistic" interpretability is difficult, instead, I would advocate for "information-theoretic" interpretability... (3/n)
English
1
1
7
1.6K
Vincent Sitzmann retweetledi
Eric Chan
Eric Chan@ericryanchan·
Today, we announce our team’s progress in pursuing a different type of foundation model for robotics: the Direct Video Action Model (DVA), which does our best to take robotics and turn it into a generative modeling problem we can scale. Technical blog: rhoda.ai/research/direc…
English
12
26
197
20.2K