Xichen Pan

50 posts

Xichen Pan banner
Xichen Pan

Xichen Pan

@xichen_pan

PhD Student @NYU_Courant, Researcher @metaai; Multimodal Generation | Prev: @MSFTResearch, @AlibabaGroup, @sjtu1896; More at https://t.co/yyS8q316AV

New York, USA Katılım Ağustos 2022
529 Takip Edilen684 Takipçiler
Sabitlenmiş Tweet
Xichen Pan
Xichen Pan@xichen_pan·
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Xichen Pan tweet media
English
9
67
412
70.5K
Xichen Pan
Xichen Pan@xichen_pan·
There has been a lot of debate around the choice of denoising space. But it’s hard to get both semantics/diffusability and strong low-level reconstruction at the same time. REPA and VA-VAE are great explorations of adding semantics into the VAE space. After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe. The final results show improvements over vanilla JiT and outperform simply applying REPA. Thanks @hanlin_hl for leading this project!
Han Lin@hanlin_hl

🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO). We find a simple but effective recipe: 1️⃣ architecture matters a lot --> fully dual-stream JiT 2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG 3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss 4️⃣ calibration is essential --> RMS-based feature rescaling We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs. 🧵 👇

English
1
9
55
6.1K
Xichen Pan
Xichen Pan@xichen_pan·
Thanks @hanlin_hl for leading this project, it's super cool to collaborate with UNC ppl.
English
0
0
2
39
Xichen Pan
Xichen Pan@xichen_pan·
After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe that can decently improve over vanilla JiT baseline.
English
1
0
1
42
Xichen Pan retweetledi
Saining Xie
Saining Xie@sainingxie·
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
Saining Xie tweet media
English
57
324
1.9K
413.4K
Xichen Pan retweetledi
XuDong Wang
XuDong Wang@XDWang101·
🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: github.com/HorizonWind200… 1/n
XuDong Wang tweet media
English
6
27
80
25.5K
Xichen Pan retweetledi
Saining Xie
Saining Xie@sainingxie·
Thanks for bringing this to my attention. I honestly wasn’t aware of the situation until the recent posts started going viral. I would never encourage my students to do anything like this—if I were serving as an Area Chair, any paper with this kind of prompt would be desk-rejected right away. That said, for any problematic submission, co-authors all share the responsibility, no excuse here. And this has been a good reminder for me, as a PI, to not just check the final PDF but also look through the full submission files. I wasn’t aware of this kind of need before. Let me take a moment to share what we found after doing a full internal review this past week--everything’s backed up by logs and screenshots, available if needed. 1. Background In November 2024, a researcher @jonLorraine9 tweeted this: x.com/jonLorraine9/s…. That was the first time I saw this kind of idea, and I think it was also when people realized that LLM prompts could be embedded in papers. Note that such injection only works if the reviewer uploads the PDF to an LLM directly. At that time, one thing we all agree is that LLMs should NOT be used for reviewing. It’s a real threat to the integrity of the process. That’s why conferences like CVPR and NeurIPS have now explicitly and strictly banned LLM reviewing (e.g., “LLMs are NOT allowed to be used for writing the reviews nor the meta-reviews at any step.”). If you've published at AI conferences, you probably know how frustrating it is to receive a review that was clearly written by an AI. It’s nearly impossible to respond to, and often just as hard to definitively prove that an LLM wrote it. While the original post might have been made partly as a joke, we all felt that trying to “fight fire with fire” isn’t the right defense--it raises more ethical issues than it solves. A better path is to address these concerns through official conference policies, not through individual hacks that can backfire. 2. What happened in our case The student author—who was visiting our group briefly from Japan—took that tweet a bit too literally and used the idea in an EMNLP submission. They copied the format exactly, not realizing it was partly a joke and could come across as manipulative or misleading. They also didn’t fully grasp how this might impact public trust in science or the integrity of peer review. On top of that, they included the same thing in the arXiv version without thinking twice. I missed it too—partly because this goes beyond the usual checks I have in place to catch anything ethically questionable as a coauthor. 3. Next steps The student has since updated the paper and reached out to ARR for formal guidance. We'll follow whatever steps they recommend. 4. Bigger picture This has been a teaching moment for me. Students under pressure don’t always think through all the ethical implications—especially in newer areas like this. My job is to guide them through these gray zones, not just react to their mistakes. Rather than punishment, what’s really needed is better education around these issues. I was upset with the student at first too. But after thinking it through, I don’t think the students should be punished beyond having the paper rejected. I’ve told them clearly this can’t happen in the future, and we’re also planning additional training around AI ethics and responsible research practices (which to me is more about having some common sense). I’ll be honest—it’s been not a good feeling being at the center of this kind of public shaming. These conversations should be thoughtful and constructive, not about singling people out. And honestly, the students feel the pressure even more. I've actually been keeping up with the public conversations around this, and in a recent poll, 45.4% of people said they think this kind of thing is actually okay. Sure, it’s just a poll and there could be bias—but it still says something about the nature of this problem. x.com/gabriberton/st… The real issue here is the current system—it creates space for things like this to happen. And this isn’t traditional academic misconduct like faking data; it’s something newer, and it calls for a deeper, more nuanced conversation about how research ethics are evolving in the age of AI. In that sense, I don’t feel too bad—I feel confident I could explain the context honestly to any ethics board. And to circle back to the original post’s question—this whole situation really highlights why we need to rethink how the game is played in academia. That’s really the main point I was trying to make in my talk. I’m going to continue doing my best to help students learn how to do solid research. (This post was written by me, with help from ChatGPT-4o on editing.)
English
10
28
216
39K
Xichen Pan retweetledi
Saining Xie
Saining Xie@sainingxie·
metaquery is now open-source — with both the data and code available.
Xichen Pan@xichen_pan

The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few learnable queries can transfer the world knowledge, strong reasoning, and in-context learning capabilities inherent in MLLMs to image generation. With the training code now available, you can train MetaQuery yourself almost as easily as fine-tuning a diffusion model. We have also open-sourced our 2.4M instruction-tuning dataset. Sourced from web corpora, it offers diverse supervision beyond copy-pasting and unlocks many new exciting capabilities. Thanks @metaai for their support in making it open source!

English
2
7
56
9.9K
Xichen Pan
Xichen Pan@xichen_pan·
The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few learnable queries can transfer the world knowledge, strong reasoning, and in-context learning capabilities inherent in MLLMs to image generation. With the training code now available, you can train MetaQuery yourself almost as easily as fine-tuning a diffusion model. We have also open-sourced our 2.4M instruction-tuning dataset. Sourced from web corpora, it offers diverse supervision beyond copy-pasting and unlocks many new exciting capabilities. Thanks @metaai for their support in making it open source!
Xichen Pan@xichen_pan

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

English
1
21
135
19.4K
Xichen Pan
Xichen Pan@xichen_pan·
@Dionilomu Yeah Sure, I think it's roughly 5k a100 hours for the base model
English
0
0
1
22
BenderL
BenderL@Dionilomu·
@xichen_pan Thx, that helps a lot! Can you share how many GPUs (hours) you used for SANA1.6B in pretraining stage? 🙏
English
1
0
0
41
Xichen Pan
Xichen Pan@xichen_pan·
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Xichen Pan tweet media
English
9
67
412
70.5K
Xichen Pan
Xichen Pan@xichen_pan·
@Dionilomu But I think the only difference for FLUX is that they have multiple text encoders. Given the signal from EMU2, they can successfully replace the two text encoders with one vision encoder, I believe replace the FLUX encoders will still work
English
1
0
1
39
BenderL
BenderL@Dionilomu·
@xichen_pan Thx for your rely! I have another question I'd like to ask you. Is the original text encoder discarded? If so, would text encoders like those in SD and Flux be compatible with this method?
English
2
0
0
55
Xichen Pan
Xichen Pan@xichen_pan·
@Dionilomu No worries! Yeah the original text encoder will be discarded. We have tested for SD v1.5 in our final table, the performance is pretty good. We haven't try FLUX training since it's too slow and needs too much resources.
English
0
0
1
25
John Nguyen
John Nguyen@__JohnNguyen__·
Nice work, i like the interleaving data curation part. Do/will you consider doing a controlled comparison between another method (e.g MetaMorph) against MetaQueries? It's hard to interpret the results from Table 4 where the base LM, LM sizes, data, and vision encoders are different.
English
1
0
0
50
Xichen Pan
Xichen Pan@xichen_pan·
@Dionilomu Thanks for your Interest! We are working on legal review, hopefully can at least open source the code in the future
English
1
0
1
125
Xichen Pan
Xichen Pan@xichen_pan·
@kirptempest for advanced generation, I think the new data pipeline really helps the model to generalize to various tasks beyond known transformation
English
0
0
0
15
Yilin Jia
Yilin Jia@kirptempest·
@xichen_pan nice work. Compared with seed-x, is there any insights on multple/multiturn/interleaved generation? Since both work use learnable query, why this perform better on single image generation? (Data, sana, e2e train?)
English
2
0
0
148
Xichen Pan
Xichen Pan@xichen_pan·
@kirptempest I think we are using less data than SEED-X, for single image generation, I think the key is to 1) use diff loss instead of regression loss 2) freeze the MLLM and focus on tuning image generation 3) scale up the number of tokens and use better connector 4) use sota diffusion model
English
0
0
1
127
Xichen Pan
Xichen Pan@xichen_pan·
@ms802x Hi! Thanks for your Interest. We are looking forward to open source this model, now we need to pass legal review first in Meta. Stay tuned for the open source progress!
English
0
0
1
237
Ali
Ali@ms802x·
@xichen_pan Nice work! Would it be possible to share the code for inference and training?
English
1
0
0
273
Xichen Pan
Xichen Pan@xichen_pan·
@zhuole1025 @sainingxie Yeah, thanks to the great Magiclens @DrogoKhal4, It was proposed for retrieval, but we found it also works very well if you have a very good base model.
English
2
0
0
95
Le Zhuo
Le Zhuo@zhuole1025·
@xichen_pan @sainingxie Thanks for your explanation! I really like the insights of collecting instruction tuning image pairs from in-the-wild data. Maybe this is the key to letting the model learn some "intelligence" beyond simple conditional generation tasks (like controlnet tasks), in a scalable way.
English
1
0
2
100
Saining Xie
Saining Xie@sainingxie·
Our take on a 4o-style AR + diffusion unified model: Transferring knowledge from an AR LLM to generation is easier than expected--you don't even need to touch the LLM. The right bridge between output modalities can unlock cool capabilities like knowledge-augmented generation!
Xichen Pan@xichen_pan

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

English
5
28
260
19.8K
Xichen Pan
Xichen Pan@xichen_pan·
@jimwinkens @jiang_zhengkai we already tuned and it's not that accurate. one simple engineering hack can be shortcut vae features to diffusion model
English
0
0
1
109
Xichen Pan
Xichen Pan@xichen_pan·
@__JohnNguyen__ @sainingxie Yeah, I think it’s a great idea to start from a frozen MLLM to build unified models. Either approach works smoothly, as there’s no need to adjust the recipe or data mix anymore. The good thing is that the capabilities can be still transferred from the frozen MLLM.
English
0
0
1
71