Yuren Cong

51 posts

Yuren Cong banner
Yuren Cong

Yuren Cong

@CongYuren

Research Scientist @Meta | exploring multimodal GenAI systems🤖

London Katılım Şubat 2021
111 Takip Edilen79 Takipçiler
Sabitlenmiş Tweet
Yuren Cong
Yuren Cong@CongYuren·
1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!
Yuren Cong tweet media
English
11
11
88
84.8K
Yuren Cong
Yuren Cong@CongYuren·
@ducha_aiki I think it’s worth to try but I wouldn’t like to give a rude guess🤔 What is sure is that you need more patience (data)
English
0
0
0
34
Dmytro Mishkin 🇺🇦
@CongYuren I am more asking if I replace DINO in VGGT with this model, would it be better at point cloud and camera prediction. Or using it inside Roma v2 for image matching:)
English
2
0
3
145
Yuren Cong
Yuren Cong@CongYuren·
1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!
Yuren Cong tweet media
English
11
11
88
84.8K
Klemen Kotar
Klemen Kotar@KlemenKotar·
@CongYuren This is really cool! Did you experiment with the size of the flow matching pixel prediction head? Is there a noticeable improvement in prediction quality as you scale it ?
English
1
0
0
108
Yuren Cong
Yuren Cong@CongYuren·
@_Suresh2 These simple layers are initialized and trained from scratch.
English
0
0
0
121
Suresh
Suresh@_Suresh2·
@CongYuren are the patch embeddings trained from scratch, or bootstrapped from a pretrained tokenizer?
English
1
0
0
135
Yuren Cong
Yuren Cong@CongYuren·
3/🚀 Why does this matter: Pretrained vision encoders may be limiting your multimodal model. By using patch embedding layers, Tuna-2 enables fully end-to-end optimization from raw pixels and can achieve higher upper limit! Just need a bit of patience..
Yuren Cong tweet media
English
0
0
1
259
Yuren Cong
Yuren Cong@CongYuren·
2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.
Yuren Cong tweet media
English
1
0
1
293
Yuren Cong
Yuren Cong@CongYuren·
@ducha_aiki If you were asking whether using the patch embedding works than using Dino, then Yes:)
English
1
0
0
249
Yuren Cong
Yuren Cong@CongYuren·
3/🚀 Why does this matter: Pretrained vision encoders may be limiting your multimodal model. By using patch embedding layers, Tuna-2 enables fully end-to-end optimization from raw pixels and can achieve higher upper limit! Just need a bit of patience..
Yuren Cong tweet media
Yuren Cong@CongYuren

2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.

English
0
0
0
208
Yuren Cong
Yuren Cong@CongYuren·
2/🚀 How: We first derive Tuna-R, a pixel-space model that relies solely on a representation encoder. Furthermore, Tuna-2 streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs.
Yuren Cong tweet media
Yuren Cong@CongYuren

1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

English
0
0
1
331
Ziwei Liu
Ziwei Liu@liuziwei7·
🚫 No Vision Encoder (VE) 🚫 No Variational Autoencoder (VAE) ✅ Just one end-to-end model directly engages with native signals, pixels and words, for both understanding and generation. 💊NEO-unify💊 is the first step toward **truly end-to-end unified models**, learning directly from near-lossless inputs via a representation space shaped by the model itself.
Ziwei Liu tweet media
English
11
84
589
72.2K
Yuren Cong
Yuren Cong@CongYuren·
Glad to share we have 3 papers accepted to #CVPR2026🥳🥳🥳: (1/3) TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models arxiv.org/abs/2512.02014
English
0
0
1
292
Yuren Cong
Yuren Cong@CongYuren·
@giffmana @cloneofsimo Great insight! 🧐If we extend this to multimodal learning, do you think initializing the modality encoders randomly will also bring any benefits?
English
0
0
0
62
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
Not saying that's what they see, but if you have data and compute, in the long run, from scratch always wins. Here is one striking example from our distillation pape; blue is random init, yellow is init with sota model *on the same target task*. And I've seen more variants of this over the years.
Lucas Beyer (bl16) tweet media
English
21
29
517
131.3K
Simo Ryu
Simo Ryu@cloneofsimo·
SAM3 was pretrained from scratch? I wonder why they didnt init from dinov3? I have a hypothesis that SSL pretrained models are actively hurtful when you have large compute budget, i wonder if thats what they saw
English
8
6
215
28.5K
Yuren Cong retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
Excited to announce that @ManusAI has joined Meta to help us build amazing AI products! The Manus team in Singapore are world class at exploring the capability overhang of today’s models to scaffold powerful agents. Looking forward to working with you, @Red_Xiao_!
English
396
428
5.5K
2.4M
Yuren Cong
Yuren Cong@CongYuren·
6/📷Attribute-Centric T2I: attribute-aware compositional T2I generation. Better control, better disentanglement. Also done during my PhD📌IJCV accepted. link.springer.com/article/10.100…
English
0
0
2
58
Yuren Cong
Yuren Cong@CongYuren·
✨My 2025 Research Wrap: pushing boundaries in multimodal research, thanks for every collaboration 1/🐟TUNA: unified representation for generation/understanding (image + video). No encoder mismatch! see more at tuna-ai.org! #GenAI #multimodal #LLMs
Yuren Cong tweet media
English
1
0
2
126