Yuren Cong

34

Dmytro Mishkin 🇺🇦@ducha_aiki·6d

@CongYuren I am more asking if I replace DINO in VGGT with this model, would it be better at point cloud and camera prediction. Or using it inside Roma v2 for image matching:)

English

2

0

3

145

Yuren Cong@CongYuren·6d

1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

English

11

88

84.8K

Yuren Cong@CongYuren·5d

@KlemenKotar do you mean the patchily size?

English

0

100

Klemen Kotar@KlemenKotar·6d

@CongYuren This is really cool! Did you experiment with the size of the flow matching pixel prediction head? Is there a noticeable improvement in prediction quality as you scale it ?

English

0

108

Yuren Cong@CongYuren·6d

@_Suresh2 These simple layers are initialized and trained from scratch.

English

121

Suresh@_Suresh2·6d

@CongYuren are the patch embeddings trained from scratch, or bootstrapped from a pretrained tokenizer?

English

0

135

Yuren Cong@CongYuren·6d

3/🚀 Why does this matter: Pretrained vision encoders may be limiting your multimodal model. By using patch embedding layers, Tuna-2 enables fully end-to-end optimization from raw pixels and can achieve higher upper limit! Just need a bit of patience..

English

1

259

Yuren Cong@CongYuren·6d

2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.

English

0

1

293

Yuren Cong@CongYuren·6d

@ducha_aiki If you were asking whether using the patch embedding works than using Dino, then Yes:)

English

0

249

Dmytro Mishkin 🇺🇦@ducha_aiki·6d

@CongYuren Is it better than DINOv3?

English

2

984

Yuren Cong@CongYuren·6d

3/🚀 Why does this matter: Pretrained vision encoders may be limiting your multimodal model. By using patch embedding layers, Tuna-2 enables fully end-to-end optimization from raw pixels and can achieve higher upper limit! Just need a bit of patience..

2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.

English

208

Yuren Cong@CongYuren·6d

2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.

1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

English

1

331

Yuren Cong@CongYuren·10 Mar

@liuziwei7 Great work!👍👍👍How large is the model?

English

33

Ziwei Liu@liuziwei7·5 Mar

🚫 No Vision Encoder (VE) 🚫 No Variational Autoencoder (VAE) ✅ Just one end-to-end model directly engages with native signals, pixels and words, for both understanding and generation. 💊NEO-unify💊 is the first step toward **truly end-to-end unified models**, learning directly from near-lossless inputs via a representation space shaped by the model itself.

English

11

84

589

72.2K

Yuren Cong@CongYuren·22 Şub

(3/3) Scaling Zero-Shot Reference-to-Video Generation arxiv.org/abs/2512.06905

(2/3) Mixture of States: Routing Token-Level Dynamics for Multimodal Generation arxiv.org/abs/2511.12207

English

78

Yuren Cong@CongYuren·22 Şub

(2/3) Mixture of States: Routing Token-Level Dynamics for Multimodal Generation arxiv.org/abs/2511.12207

Glad to share we have 3 papers accepted to #CVPR2026🥳🥳🥳: (1/3) TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models arxiv.org/abs/2512.02014

English

131

Yuren Cong@CongYuren·22 Şub

Glad to share we have 3 papers accepted to #CVPR2026🥳🥳🥳: (1/3) TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models arxiv.org/abs/2512.02014

English

1

292

Yuren Cong@CongYuren·22 Şub

Excited to share TUNA is accepted to #CVPR 2026, which is among the first to scale native UMMs to video understanding+generation🚀🚀🚀

🔥We present Tuna, a native unified multimodal model built on a unified continuous visual representation, enabling diverse multimodal understanding and generation capabilities (T2I/I2T/I2I/T2V/V2T): 1️⃣ Unified Representation We build an effective unified representation space by cascading a VAE encoder with a representation encoder. 2️⃣ Mutual Benefit Within a unified representation space, joint training enables understanding and generation to mutually benefit each other. 3️⃣ Representation Encoders matter Stronger representation encoders consistently yield better performance across all multimodal task. 🌟 Our method generalizes beyond the image domain to the video domain as well. Notably, TUNA-1.5B demonstrates outstanding performance on both video generation and understanding tasks. 🐟 Project page: tuna-ai.org

English

3

244

Yuren Cong@CongYuren·12 Oca

@giffmana @cloneofsimo Great insight! 🧐If we extend this to multimodal learning, do you think initializing the modality encoders randomly will also bring any benefits?

English

62

Lucas Beyer (bl16)@giffmana·2 Ara

Not saying that's what they see, but if you have data and compute, in the long run, from scratch always wins. Here is one striking example from our distillation pape; blue is random init, yellow is init with sota model *on the same target task*. And I've seen more variants of this over the years.

English

21

29

517

131.3K

Simo Ryu@cloneofsimo·2 Ara

SAM3 was pretrained from scratch? I wonder why they didnt init from dinov3? I have a hypothesis that SSL pretrained models are actively hurtful when you have large compute budget, i wonder if thats what they saw

English

8

6

215

28.5K

Yuren Cong@CongYuren·2 Oca

@ziqi_huang_ amazing work! Is it too late to get TUNA on the table?

English

1

62

Ziqi Huang@ziqi_huang_·1 Oca

Unified multimodal models can jointly enable visual understanding and generation, but how and when do these abilities truly reinforce each other? Meet the 🔍𝗨𝗻𝗶-𝗠𝗠𝗠𝗨 benchmark to know more. 🌐 vchitect.github.io/Uni-MMMU-Proje… 📄 arxiv.org/abs/2510.13759 💻 github.com/vchitect/Uni-M…

English

2

22

115

18.2K

Yuren Cong retweetledi

Alexandr Wang@alexandr_wang·30 Ara

Excited to announce that @ManusAI has joined Meta to help us build amazing AI products! The Manus team in Singapore are world class at exploring the capability overhang of today’s models to scaffold powerful agents. Looking forward to working with you, @Red_Xiao_!

English

396

428

5.5K

2.4M

Yuren Cong@CongYuren·30 Ara

6/📷Attribute-Centric T2I: attribute-aware compositional T2I generation. Better control, better disentanglement. Also done during my PhD📌IJCV accepted. link.springer.com/article/10.100…

English

2

58

Yuren Cong@CongYuren·30 Ara

5/🌉SPAN: scene graphs × images, aligned with Transformers. Done during my PhD📌TPAMI accepted, a long-running🏃 ieeexplore.ieee.org/document/11231… #scene_graph #scene_understanding

English

0

2

65

Yuren Cong@CongYuren·30 Ara

✨My 2025 Research Wrap: pushing boundaries in multimodal research, thanks for every collaboration 1/🐟TUNA: unified representation for generation/understanding (image + video). No encoder mismatch! see more at tuna-ai.org! #GenAI #multimodal #LLMs

English