Kolja Bauer (@KoljaBauer) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Do we really need pixel generation to model motion? 🤔 We show how directly representing motion in a compact space enables efficient, scalable planning. 10,000× faster than video models, enabling planning and reasoning in open-world and robotics settings. Check it out ⬇️

Nick Stracke@rmsnorm

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

1

4

29

3.4K

Kolja Bauer retweetledi

Simo Ryu@cloneofsimo·16h

Cool stuff

Nick Stracke@rmsnorm

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

0

4

27

5.7K

Kolja Bauer retweetledi

Miguel Angel Bautista@itsbautistam·1d

Amazing work led by @rmsnorm @KoljaBauer and our collaborators at LMU, to be presented at @CVPR! Personally, I find this question of "what's the right level of abstraction for planning in physical space?" to be very intriguing. Pixels over time are very low SNR (ie. the argument behind JEPA) but motion/trajectories carries a lot on information while being extremely compressible. I believe there's a lot more to uncover from this direction. Very glad to be part of this one!

Nick Stracke@rmsnorm

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

1

3

12

1.4K

Kolja Bauer retweetledi

Stefan Baumann@StefanABaumann·2d

You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step. We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models. Myriad, accepted at @CVPR 2026

English

4

54

344

24.4K

Kolja Bauer retweetledi

Pingchuan Ma@PingchuanMa4·18 Eki

I'm happy to share that I’ll be presenting two first-authored papers at #ICCV2025 🌺 in Honolulu, together with @MingGui725184! 🏝️ (Thread 🧵👇)

English

1

7

9

1.1K

Kolja Bauer retweetledi

jo.schb@jo_schb·17 Eki

🤔 What if you could generate an entire image using just one continuous token? 💡 It works if we leverage a self-supervised representation! Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵👇

English

8

23

109

16.9K

Kolja Bauer retweetledi

Stefan Baumann@StefanABaumann·15 Eki

🤔 What happens when you poke a scene — and your model has to predict how the world moves in response? We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions. It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇

English

5

15

38

6.4K

Kolja Bauer retweetledi

Felix Krause@felix_m_krause·14 Eki

We cut the cost of training a diffusion model from months of rent to one night out. TREAD matches ImageNet performance of a DiT with 97% fewer A100 hours! No extra components. No extra losses. Training‑time only. Inference remains unchanged. Accepted at ICCV2025🌺

English

14

81

820

54.3K

Kolja Bauer retweetledi

Nick Stracke@rmsnorm·5 Ara

🤔 Why do we extract diffusion features from noisy images? Isn’t that destroying information? Yes, it is - but we found a way to do better. 🚀 Here’s how we unlock better features, no noise, no hassle 🧵👇

English

3

41

200

40.1K

Kolja Bauer

Keşfet