Nick Stracke

38

Nurvai - The Data Layer for Physical AI@nurvai_ai·1d

Really interesting idea. Compressing motion instead of pixels feels like a much more natural representation for planning. Curious how well these motion embeddings hold up in long-horizon tasks where small errors compound. Do they remain stable, or do you still need periodic re-grounding from observations?

English

0

2

283

Nick Stracke@rmsnorm·1d

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

9

47

304

38.6K

Nick Stracke retweetledi

Simo Ryu@cloneofsimo·14h

Cool stuff

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

4

27

5.5K

Nick Stracke retweetledi

Nick Stracke@rmsnorm·1d

Stop predicting motion step-by-step. Model the whole motion in a compact representation for efficient planning. 📄 Paper: arxiv.org/abs/2604.11737 💻 Models: compvis.github.io/long-term-moti… Joint work with @KoljaBauer, @StefanABaumann, @itsbautistam, Josh Susskind, and Björn Ommer.

English

6

30

2.3K

Nick Stracke@rmsnorm·1d

@itsbautistam @KoljaBauer @CVPR Thanks Miguel!

English

1

73

Nick Stracke retweetledi

Miguel Angel Bautista@itsbautistam·1d

Amazing work led by @rmsnorm @KoljaBauer and our collaborators at LMU, to be presented at @CVPR! Personally, I find this question of "what's the right level of abstraction for planning in physical space?" to be very intriguing. Pixels over time are very low SNR (ie. the argument behind JEPA) but motion/trajectories carries a lot on information while being extremely compressible. I believe there's a lot more to uncover from this direction. Very glad to be part of this one!

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

3

10

1.3K

Nick Stracke retweetledi

Brian Roemmele@BrianRoemmele·1d

A massive step forward for AI video!

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

3

27

5K

Nick Stracke@rmsnorm·1d

@Frid45 Thanks! 🇪🇺 Supporting European research, I see 👀

English

4

357

Frid 🇪🇺🦌@Frid45·1d

@rmsnorm Very good thread 🫡!

English

0

4

426

Nick Stracke retweetledi

atharva ☆@k7agar·1d

I have been saying

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

Neerja Thakkar@neerjathakkar

1

23

5K

Nick Stracke@rmsnorm·1d

@KoljaBauer @StefanABaumann @itsbautistam 1️⃣x.com/neerjathakkar/… Also, shoutout to two other recent works that explore how to use point tracks for world modeling. 👇...

What’s the right representation for a world model? 3D, pixels, or something else? Excited to release our new paper “Forecasting Motion in the Wild” where we propose point tracks as tokens for generating complex non-rigid motion and behavior From @GoogleDeepmind @Berkeley_AI @TTIC_Connect

English

Stefan Baumann@StefanABaumann

4

20

1.7K

Nick Stracke@rmsnorm·1d

@KoljaBauer @StefanABaumann @itsbautistam 2️⃣x.com/StefanABaumann…

You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step. We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models. Myriad, accepted at @CVPR 2026

QME

2

15

867

Nick Stracke retweetledi

Kolja Bauer@KoljaBauer·1d

Do we really need pixel generation to model motion? 🤔 We show how directly representing motion in a compact space enables efficient, scalable planning. 10,000× faster than video models, enabling planning and reasoning in open-world and robotics settings. Check it out ⬇️

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

4

29

3.4K

Nick Stracke retweetledi

Stefan Baumann@StefanABaumann·2d

You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step. We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models. Myriad, accepted at @CVPR 2026

English

4

54

344

24.3K

Nick Stracke@rmsnorm·21 Oca

@dooartlabs This is also a really cool idea!

English

0

1

12

unsupervised@dooartlabs·20 Oca

@rmsnorm huh! this is really interesting i'm doing something similar, but it's more like generating the entire level at once based on the player's actions in the previous level

English