Bowei Chen

Too many REPA / RAE / representation alignment papers lately? I was lost too, so I wrote a blog post that organizes the space into phases and zooms in on what actually matters for general/molecular ML. Curious what folks think - link below! 🔗 Blog: kdidi.netlify.app/blog/ml/2025-1…

1

84

Vivek Jayaram@vivjay30·12 Şub

Overdue life update: I recently joined @sesame where I lead AI safety for the real-time conversational systems! Smart glasses + voice is the future. After trying Sesame’s upcoming glasses, I was blown away. It’s also the most realistic conversational AI I’ve seen. Real-time voice AI introduces entirely new safety problems and I'm glad to be focused on making our AI safe and aligned. We're hiring like crazy, so if you're interested in conversational voice systems or safety research then reach out!

English

5

0

14

679

Bowei Chen@bowei_chen_19·13 Şub

Nice blog, highly recommend!

Kieran Didi@DidiKieran

English

5

386

Bowei Chen retweetledi

Jingwei Ma@JingweiMa2·11 Ara

Excited to present UltraZoom at SIGGRAPH Asia next Tuesday (Dec.16)! UltraZoom converts sparse phone captures of an object into a single gigapixel-resolution image that you can seamlessly explore. Threads below. Website: ultra-zoom.github.io Paper: arxiv.org/abs/2506.13756

English

3

12

1K

Bowei Chen retweetledi

Hansheng Chen@HanshengCh·17 Eki

Excited to announce a new track of accelerating Generative AI: pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation github.com/Lakonik/piFlow Distill 20B flow models now using just an L2 loss via imitation learning for SOTA diversity and teacher-aligned quality.

English

27

155

36K

Bowei Chen@bowei_chen_19·15 Eki

The Representation Autoencoders (RAE) by @sainingxie's team is fascinating — a brilliant demonstration that high-dimensional diffusion is indeed feasible. In our latest work on semantic encoders, we align a pretrained foundation encoder (e.g., DINOv2) as a visual tokenizer, achieving better reconstruction quality while preserving semantic consistency. Instead of freezing the encoder, we introduce a semantics-preserving fine-tuning strategy that significantly improves reconstruction quality. I can see great potential in combining RAE with our approach to build semantically rich tokenizers with large channel dimension and strong reconstruction fidelity.

Bowei Chen@bowei_chen_19

We found that visual foundation encoder can be aligned to serve as tokenizers for latent diffusion models in image generation! Our new paper introduces a new tokenizer training paradigm that produces a semantically rich latent space, improving diffusion model performance🚀🚀.

English

20

228

23.6K

Bowei Chen@bowei_chen_19·15 Eki

@SwayStar123 @sainingxie Yes! I can see great potential in combining RAE with our approach to build semantically rich tokenizers with large channel dimension and strong reconstruction fidelity (we fine-tuned the encoder for better reconstruction).

English

1

35

sway@SwayStar123·14 Eki

@sainingxie Very similar/related work btw: x.com/bowei_chen_19/…

Bowei Chen@bowei_chen_19

We found that visual foundation encoder can be aligned to serve as tokenizers for latent diffusion models in image generation! Our new paper introduces a new tokenizer training paradigm that produces a semantically rich latent space, improving diffusion model performance🚀🚀.

English

0

14

5.2K

Saining Xie@sainingxie·14 Eki

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

57

329

1.9K

413.9K

Bowei Chen@bowei_chen_19·3 Eki

@Jacoed Yes, this is shown in both our work and previous work like VA-VAE.

English

0

1

18

Ed@Jacoed·3 Eki

@bowei_chen_19 "hence better diffusability" are we sure better semantic grounding implies better diffusability ?

English

0

66

Bowei Chen@bowei_chen_19·30 Eyl

We found that visual foundation encoder can be aligned to serve as tokenizers for latent diffusion models in image generation! Our new paper introduces a new tokenizer training paradigm that produces a semantically rich latent space, improving diffusion model performance🚀🚀.

English

7

71

523

80.7K

Bowei Chen@bowei_chen_19·30 Eyl

We hope our findings inspire a rethinking of tokenizer design in generative modeling. 🙏 Huge shoutout to my amazing co-authors and collaborators @KaiZhang9546,@Sai__Bi,@HaoTan5,@zhanghesprinter ,@tianyuanzhang99 ,@zhengqi_li, @bitxiong,@jianming_zhang_. [9/N]

English

1

14

1.1K

Bowei Chen@bowei_chen_19·30 Eyl

On LAION 2B dataset, we train a text-to-image diffusion model on our tokenizer, which converges faster and surpasses the FLUX-VAE baseline. Check out more details and results in our paper! [8/N]

English

13

1.2K

Bowei Chen@bowei_chen_19·20 Haz

@somebobcat8327 Thanks!

English

1

32

Bobcat@somebobcat8327·20 Haz

@bowei_chen_19 Tremendous idea

Français

0

53

Bowei Chen@bowei_chen_19·15 Haz

#CVPR2024 Arm-captured selfies only capture your partial body. Instead, what if you could capture a full-body photo that someone else would take of you in the scene? We present Total Selfie, which generates full-body selfies from photographs originally taken at arms length. 1/n

English

7

1.9K

Bowei Chen@bowei_chen_19·20 Haz

We will be presenting Total Selfie at Arch 4A-E #185 this afternoon. Come and talk with us!

Bowei Chen@bowei_chen_19

#CVPR2024 Arm-captured selfies only capture your partial body. Instead, what if you could capture a full-body photo that someone else would take of you in the scene? We present Total Selfie, which generates full-body selfies from photographs originally taken at arms length. 1/n

English