jo.schb (@jo_schb) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

jo.schb@jo_schb·17 Eki

🤔 What if you could generate an entire image using just one continuous token? 💡 It works if we leverage a self-supervised representation! Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵👇

English

8

23

109

16.9K

jo.schb retweetledi

Nick Stracke@rmsnorm·1d

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

English

9

48

315

40.6K

jo.schb retweetledi

Kosta Derpanis@CSProfKGD·24 Mar

Going to miss lunch times hanging out with the mensa-table tennis crew 🏓

English

2

1

28

2.1K

jo.schb retweetledi

Tao HU@vtaohu·10 Şub

One of the best ways to spot new research trends is to look at which papers get cited the fastest. I recently found rleak.com, which tracks citation rankings across top conferences like AAAI. i also found: DepthFM ranks #7 among the most-cited AAAI paper in 3k🚀

English

0

1

6

811

jo.schb@jo_schb·18 Eki

@ma_sc_ We will release our training and inference code soon :)

English

0

2

145

Mauro S.@ma_sc_·18 Eki

@jo_schb training code?

English

1

0

164

jo.schb@jo_schb·17 Eki

🤔 What if you could generate an entire image using just one continuous token? 💡 It works if we leverage a self-supervised representation! Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵👇

English

8

23

109

16.9K

jo.schb retweetledi

Pingchuan Ma@PingchuanMa4·18 Eki

I'm happy to share that I’ll be presenting two first-authored papers at #ICCV2025 🌺 in Honolulu, together with @MingGui725184! 🏝️ (Thread 🧵👇)

English

1

7

9

1.1K

jo.schb retweetledi

Miguel Angel Bautista@itsbautistam·18 Eki

There has been quite a lot of talk recently about SSL representations in generative models. IMHO if you are training an image generative model in latent space you should aim for as much compute efficiency as possible (otherwise what's the point?). The amazing @jo_schb and @MingGui725184 + collaborators at LMU have really cracked this problem with RepTok, please check the thread! A common drawback of most works in this direction (even the most recent ones) is that they show viability for ImageNet only, which has its issues (specially if using DINOv2 features). @jo_schb and @MingGui725184 found that RepTok allows you to compress images so much that you can use a pure MLP-based architecture for the more general T2I problem setting obtaining really good results while drastically reducing training compute. I am super grateful to have had the chance to advise the team on this one!

jo.schb@jo_schb

🤔 What if you could generate an entire image using just one continuous token? 💡 It works if we leverage a self-supervised representation! Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵👇

English

0

4

21

3.9K

jo.schb@jo_schb·17 Eki

Also, check out concurrent works on how to use SSL encoders for latent space models: • REA: arxiv.org/abs/2510.11690 [@boyangzheng_] • AlignTok: arxiv.org/abs/2509.25162 [@bowei_chen_19]

English

0

7

527

jo.schb@jo_schb·17 Eki

This work was co-led by @MingGui725184 and me and wouldn't have been possible without the help of all the other collaborators: Timy Phan, @felix_m_krause, @jmsusskind, @itsbautistam, and Björn Ommer. A big thank you to all of them🙏

English

1

0

7

577

jo.schb retweetledi

Stefan Baumann@StefanABaumann·15 Eki

🤔 What happens when you poke a scene — and your model has to predict how the world moves in response? We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions. It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇

English

5

15

38

6.4K

jo.schb

Keşfet