Vincent Sitzmann

905 posts

Vincent Sitzmann

@vincesitzmann

Building AI that learns by interacting with the world. Assistant Professor @ MIT, leading the Scene Representation Group (https://t.co/h5gvhLYZj4).

Cambridge, Massachusetts Katılım Şubat 2016

311 Takip Edilen18.7K Takipçiler

Sabitlenmiş Tweet

Vincent Sitzmann@vincesitzmann·16 Şub

In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…

English

164

383.4K

Vincent Sitzmann@vincesitzmann·4d

Cool work on curiosity adapted for 3D reconstruction and exploration, very fun :)

Lily Goli@lily_goli

🚀 🚀 🚀 Excited to share our new paper: Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration What does it take for an agent to stay curious in a 3D world? The answer is memory. 🌐 Project: recuriosity.github.io 📄 Paper: arxiv.org/abs/2605.22814 💻 Code: github.com/recuriosity/re…

English

14.7K

Vincent Sitzmann@vincesitzmann·14 May

Cool work by my former lab at Stanford on pixel-space image diffusion!

Hansheng Chen@HanshengCh

New paper: AsymFlow🔥 JiT x0-prediction is not enough for pixel generation. Better keep velocity in a low-rank subspace: - 1.57 FID on ImageNet (best pixel flow model) - Finetunes FLUX.2 klein into pixel space, beats the original on HPSv3/DPG/GenEval (#1 overall on HPSv3) 1/7

English

9.8K

Vincent Sitzmann@vincesitzmann·8 May

@keenanisalive @rms80 @jon_barron ...there could be so many upsides! So many things are difficult to do with meshes and game engines that could be easy with these models. It also seems that "speed" as an advantage ultimately is eroded by compute and architecture advances...?

English

414

Vincent Sitzmann@vincesitzmann·8 May

@keenanisalive @rms80 @jon_barron Ah sorry, I misunderstood! For content delivery, I am more amenable to 3D reps :) However, I don't think it's crazy that pixel-generating models may surprise us in the end. They need not be "general", they could be game-specific, interacting with a small, hard-coded game state...

English

467

Jon Barron@jon_barron·8 May

@rms80 We're still in the Crash Bandicoot stage right now, but it's hard to imagine the tech saturating at this level.

English

2.7K

Vincent Sitzmann@vincesitzmann·8 May

@keenanisalive @rms80 @jon_barron I think it is unlikely that self-driving will leverage explicit 3D even in the relatively short-term (5 years from now). Further, robotics has tried exactly that forever, and it never yielded generality or robustness. Why would it be different now?

English

723

Keenan Crane@keenanisalive·8 May

Correct. This is the usual pendulum of dogma, just like we saw with autonomous driving. First, the claim is everything is going to work end-to-end. The pure video era. Just believe in the bitter lesson, and wait 18 months. We promise. When that fails, we go hybrid. All we need is a little bit of 3D. But honestly this is just to boost performance of a model that we still believe would have worked with enough scaling. Just give us another 18 months. Then investors get impatient, and the smaller startups close up shop or get acqui-hired. 18 months pass. The survivors keep grinding, and stop caring about the dogma of how things get built—focusing more on delivering something of value, by any means. The Waymo era. My prediction: Waymo era of video-based (and even 3D Gaussian-based) world models is to simply distill the output of these models back down to conventional, explicit 3D representations like animated meshes and textures, with a few splats here and there when you really need them. That’s how you deliver at scale, and for any specific use case, you don’t need to be running the full foundation model in real time.

English

8.5K

Vincent Sitzmann@vincesitzmann·1 May

@pulkitology Congrats, Pulkit, looks awesome!

English

163

Pulkit Agrawal@pulkitology·29 Nis

Eka means unity -- “one,” in Sanskrit and “first” in Finnish. We’re building intelligence for the physical world in its native language: forces. Until now, robotics faced a tradeoff — generality or speed. The real world requires both. Robotics also faced a data problem. Our Vision–Force–Action (VFA) model — the first of its kind — breaks the generality-speed tradeoff and the data barrier. It's a new foundation uniting performance, generality, and safety for putting capable robots in everyone's hands. Today, I am excited to share our journey of pushing robots beyond human limits. Today, dexterity becomes scalable. Today, I welcome you to the Era of Eka. Co-founded with @haarnoja, and so thrilled and grateful to be working with a dream team at @EkaRobotics. Learn more: ekarobotics.com

English

221

315.6K

Vincent Sitzmann@vincesitzmann·28 Nis

Love this idea :)

David Duvenaud@DavidDuvenaud

Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵

English

10.1K

Vincent Sitzmann retweetledi

Tommy Mitchel@twmitchel·24 Nis

I'm building a team! Posting for full-time research positions will go live next week!

reactor@reactorworld

Developers who got early access to Reactor have been building experiences that were not possible 6 months ago. We're hiring the people who want to build what comes next: reactor.inc/careers

English

5.4K

Vincent Sitzmann retweetledi

Chonghyuk (ND) Song@ndsong95·24 Nis

Check out our #ICLR2026 paper Generative View Stitching! I unfortunately couldn’t attend but @MichalStaryy will be presenting our poster tomorrow (Sat) morning at Pavillon 4 PA-#3016. Shoutout to my other collaborators @BoyuanChen0, @gkopanas, and @vincesitzmann!

Chonghyuk (ND) Song@ndsong95

Introducing Generative View Stitching (GVS), a non-autoregressive sampling method for length extrapolation of video diffusion models. GVS enables collision-free camera-guided video generation for predefined trajectories, including Oscar Reutersvärd's Impossible Staircase (1/9).

English

6.7K

Vincent Sitzmann@vincesitzmann·23 Nis

@jon_barron What a cool idea!

English

479

Jon Barron@jon_barron·23 Nis

I’m biased of course, but I’m particularly pleased with the depth <-> RGB bijection we came up with for monocular depth estimation, which arose from thinking about space filling curves and the power transform work that fell out of Zip-NeRF. And how cool is this figure?

English

174

10.2K

Jon Barron@jon_barron·23 Nis

We have an important result to share: if you reduce multiple dense vision tasks into a single RGB-image-prediction task, fine-tuning a strong image generator (in our case Nano Banana Pro) matches or beats all specialized models for monodepth, normals, and semantic segmentation.

Songyou Peng@songyoupeng

Yay, finally! Introducing Vision Banana🍌 from @GoogleDeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: vision-banana.github.io (1/5)

English

488

66.4K

Vincent Sitzmann@vincesitzmann·21 Nis

Congrats, @BoyuanChen0 @kiwhansong0 - looks like absolutely amazing work!!

Boyuan Chen@BoyuanChen0

Super grateful to be part of this amazing team behind GPT Image 2!

English

4.6K

Vincent Sitzmann@vincesitzmann·26 Mar

@taiyasaki @sherwinbahmani Congrats, @sherwinbahmani, that's huge!!

English

305

Andrea Tagliasacchi 🇨🇦@taiyasaki·23 Mar

Congratulations Dr. @sherwinbahmani

Français

138

11.1K

Vincent Sitzmann retweetledi

Artem Lukoianov@ottogin1·24 Mar

Join us at @CVPR in Denver for a full-day tutorial about Analytic Understanding of Diffusion Models. The training objective of diffusion models has a closed-form solution -- yet it only memorizes. How do real models generalize? We'll unpack this paradox and the emerging analytical theory behind it. @yuancy @CScarvelis @MasonKamb @WangBinxu @vincesitzmann @JustinMSolomon @SuryaGanguli

English

197

32.5K

Seungwook Han@seungwookh·12 Mar

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

261

1.7K

253.7K

Vincent Sitzmann@vincesitzmann·13 Mar

@seungwookh Very cool work, Seungwook!

English

894

Vincent Sitzmann@vincesitzmann·11 Mar

I.e., can we analyze what the neural network learns as a function of the dataset statistics? This paper is one such example, and it's really fun :)

English

875

Vincent Sitzmann@vincesitzmann·11 Mar

There are lots of open questions in the quest for understanding generative models and diffusion models. One of our core lessons is that "mechanistic" interpretability is difficult, instead, I would advocate for "information-theoretic" interpretability... (3/n)

English

1.6K

Vincent Sitzmann@vincesitzmann·11 Mar

If you liked our work on emergent locality in image diffusion, check out Artem's twitter post discussing the key insights! The core takeaway is that shift invariance and locality are not necessary for generalization to arise in diffusion models... (1/n)

Artem Lukoianov@ottogin1

Why do diffusion models produce new images instead of just memorizing the dataset? We show that they learn pixel correlation patterns from the data and therefore denoise locally, which promotes generalization. To test this idea, we compare trained diffusion models with a training-free algorithm that mixes local patches from the dataset. Surprisingly, this simple procedure already reproduces many properties of the trained models. 🧵 Check out this thread for more details about our Spotlight NeurIPS paper with @yuancy, @JustinMSolomon and @vincesitzmann.

English

11.5K

Vincent Sitzmann retweetledi

Eric Chan@ericryanchan·10 Mar

Today, we announce our team’s progress in pursuing a different type of foundation model for robotics: the Direct Video Action Model (DVA), which does our best to take robotics and turn it into a generative modeling problem we can scale. Technical blog: rhoda.ai/research/direc…

English

197

20.2K

Keşfet

@keenanisalive @rms80 @jon_barron @pulkitology @haarnoja @EkaRobotics @MichalStaryy @BoyuanChen0