
Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇











