Xichen Pan

9

55

6.1K

Xichen Pan@xichen_pan·3d

Thanks @hanlin_hl for leading this project, it's super cool to collaborate with UNC ppl.

English

2

39

Xichen Pan@xichen_pan·3d

After JiT came out, we started thinking about adding semantics directly into pixel space to improve generation. We explore co-denoising as another form of visual representation alignment and provide a detailed training recipe that can decently improve over vanilla JiT baseline.

English

0

1

42

Xichen Pan@xichen_pan·16 Ara

(M)LLMs are effective at grounding and planning spatiotemporal layouts. Yet, they are mostly used as 1D conditional encoders in current generative and unified models. We explore an explicit interface to transfer these abilities to image editing and video generation.

Han Lin@hanlin_hl

Multimodal LLMs (MLLMs) excel at reasoning, layout understanding, and planning—yet in diffusion-based generation, they are often reduced to simple multimodal encoders. What if MLLMs could reason directly in latent space and guide diffusion generation with fine-grained, spatiotemporal control? 🤔 Introducing MetaCanvas 🎨 A lightweight framework that translates MLLM reasoning into structured spatiotemporal conditions for diffusion models. 🧵 👇

English

8

16

2.8K

Xichen Pan retweetledi

Saining Xie@sainingxie·14 Eki

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

57

324

1.9K

413.4K

Xichen Pan retweetledi

XuDong Wang@XDWang101·11 Eyl

🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: github.com/HorizonWind200… 1/n

English

6

27

80

25.5K

Xichen Pan retweetledi

Saining Xie@sainingxie·7 Tem

Thanks for bringing this to my attention. I honestly wasn’t aware of the situation until the recent posts started going viral. I would never encourage my students to do anything like this—if I were serving as an Area Chair, any paper with this kind of prompt would be desk-rejected right away. That said, for any problematic submission, co-authors all share the responsibility, no excuse here. And this has been a good reminder for me, as a PI, to not just check the final PDF but also look through the full submission files. I wasn’t aware of this kind of need before. Let me take a moment to share what we found after doing a full internal review this past week--everything’s backed up by logs and screenshots, available if needed. 1. Background In November 2024, a researcher @jonLorraine9 tweeted this: x.com/jonLorraine9/s…. That was the first time I saw this kind of idea, and I think it was also when people realized that LLM prompts could be embedded in papers. Note that such injection only works if the reviewer uploads the PDF to an LLM directly. At that time, one thing we all agree is that LLMs should NOT be used for reviewing. It’s a real threat to the integrity of the process. That’s why conferences like CVPR and NeurIPS have now explicitly and strictly banned LLM reviewing (e.g., “LLMs are NOT allowed to be used for writing the reviews nor the meta-reviews at any step.”). If you've published at AI conferences, you probably know how frustrating it is to receive a review that was clearly written by an AI. It’s nearly impossible to respond to, and often just as hard to definitively prove that an LLM wrote it. While the original post might have been made partly as a joke, we all felt that trying to “fight fire with fire” isn’t the right defense--it raises more ethical issues than it solves. A better path is to address these concerns through official conference policies, not through individual hacks that can backfire. 2. What happened in our case The student author—who was visiting our group briefly from Japan—took that tweet a bit too literally and used the idea in an EMNLP submission. They copied the format exactly, not realizing it was partly a joke and could come across as manipulative or misleading. They also didn’t fully grasp how this might impact public trust in science or the integrity of peer review. On top of that, they included the same thing in the arXiv version without thinking twice. I missed it too—partly because this goes beyond the usual checks I have in place to catch anything ethically questionable as a coauthor. 3. Next steps The student has since updated the paper and reached out to ARR for formal guidance. We'll follow whatever steps they recommend. 4. Bigger picture This has been a teaching moment for me. Students under pressure don’t always think through all the ethical implications—especially in newer areas like this. My job is to guide them through these gray zones, not just react to their mistakes. Rather than punishment, what’s really needed is better education around these issues. I was upset with the student at first too. But after thinking it through, I don’t think the students should be punished beyond having the paper rejected. I’ve told them clearly this can’t happen in the future, and we’re also planning additional training around AI ethics and responsible research practices (which to me is more about having some common sense). I’ll be honest—it’s been not a good feeling being at the center of this kind of public shaming. These conversations should be thoughtful and constructive, not about singling people out. And honestly, the students feel the pressure even more. I've actually been keeping up with the public conversations around this, and in a recent poll, 45.4% of people said they think this kind of thing is actually okay. Sure, it’s just a poll and there could be bias—but it still says something about the nature of this problem. x.com/gabriberton/st… The real issue here is the current system—it creates space for things like this to happen. And this isn’t traditional academic misconduct like faking data; it’s something newer, and it calls for a deeper, more nuanced conversation about how research ethics are evolving in the age of AI. In that sense, I don’t feel too bad—I feel confident I could explain the context honestly to any ethics board. And to circle back to the original post’s question—this whole situation really highlights why we need to rethink how the game is played in academia. That’s really the main point I was trying to make in my talk. I’m going to continue doing my best to help students learn how to do solid research. (This post was written by me, with help from ChatGPT-4o on editing.)

English

10

28

216

39K

Xichen Pan retweetledi

Saining Xie@sainingxie·27 Haz

metaquery is now open-source — with both the data and code available.

Xichen Pan@xichen_pan

The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few learnable queries can transfer the world knowledge, strong reasoning, and in-context learning capabilities inherent in MLLMs to image generation. With the training code now available, you can train MetaQuery yourself almost as easily as fine-tuning a diffusion model. We have also open-sourced our 2.4M instruction-tuning dataset. Sourced from web corpora, it offers diverse supervision beyond copy-pasting and unlocks many new exciting capabilities. Thanks @metaai for their support in making it open source!

English

7

56

9.9K

Xichen Pan@xichen_pan·27 Haz

The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few learnable queries can transfer the world knowledge, strong reasoning, and in-context learning capabilities inherent in MLLMs to image generation. With the training code now available, you can train MetaQuery yourself almost as easily as fine-tuning a diffusion model. We have also open-sourced our 2.4M instruction-tuning dataset. Sourced from web corpora, it offers diverse supervision beyond copy-pasting and unlocks many new exciting capabilities. Thanks @metaai for their support in making it open source!

Xichen Pan@xichen_pan

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

English

21

135

19.4K

Xichen Pan@xichen_pan·20 Nis

@Dionilomu Yeah Sure, I think it's roughly 5k a100 hours for the base model

English

1

22

BenderL@Dionilomu·19 Nis

@xichen_pan Thx, that helps a lot! Can you share how many GPUs (hours) you used for SANA1.6B in pretraining stage? 🙏

English

0

41

Xichen Pan@xichen_pan·12 Nis

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

English

9

67

412

70.5K

Xichen Pan@xichen_pan·18 Nis

@Dionilomu But I think the only difference for FLUX is that they have multiple text encoders. Given the signal from EMU2, they can successfully replace the two text encoders with one vision encoder, I believe replace the FLUX encoders will still work

English

0

1

39

BenderL@Dionilomu·18 Nis

@xichen_pan Thx for your rely! I have another question I'd like to ask you. Is the original text encoder discarded? If so, would text encoders like those in SD and Flux be compatible with this method?

English

0

55

Xichen Pan@xichen_pan·18 Nis

@Dionilomu No worries! Yeah the original text encoder will be discarded. We have tested for SD v1.5 in our final table, the performance is pretty good. We haven't try FLUX training since it's too slow and needs too much resources.

English

1

25

Xichen Pan@xichen_pan·17 Nis

@__JohnNguyen__ @sainingxie @j1h0u @ImSNShukla @felixudr @iam_aashusingh @zhuokaiz @shlokkkk @zhiyangx11 @JiuhaiC Thank you! I think it's pretty hard to do comparison with other models as most of them are jointly tuning the LLM backbone (it's hard to say the mutual impact between generation and understanding), but MetaQuery are only tuned on image generation.

English

1

149

John Nguyen@__JohnNguyen__·16 Nis

Nice work, i like the interleaving data curation part. Do/will you consider doing a controlled comparison between another method (e.g MetaMorph) against MetaQueries? It's hard to interpret the results from Table 4 where the base LM, LM sizes, data, and vision encoders are different.

English

0

50

Xichen Pan@xichen_pan·17 Nis

@Dionilomu Thanks for your Interest! We are working on legal review, hopefully can at least open source the code in the future

English

0

1

125

BenderL@Dionilomu·16 Nis

@xichen_pan will it be open-source?

English

0

145

Xichen Pan@xichen_pan·17 Nis

@kirptempest for advanced generation, I think the new data pipeline really helps the model to generalize to various tasks beyond known transformation

English

15

Yilin Jia@kirptempest·16 Nis

@xichen_pan nice work. Compared with seed-x, is there any insights on multple/multiturn/interleaved generation? Since both work use learnable query, why this perform better on single image generation? (Data, sana, e2e train?)

English

0

148

Xichen Pan@xichen_pan·17 Nis

@kirptempest I think we are using less data than SEED-X, for single image generation, I think the key is to 1) use diff loss instead of regression loss 2) freeze the MLLM and focus on tuning image generation 3) scale up the number of tokens and use better connector 4) use sota diffusion model

English

1

127

Xichen Pan@xichen_pan·14 Nis

@ms802x Hi! Thanks for your Interest. We are looking forward to open source this model, now we need to pass legal review first in Meta. Stay tuned for the open source progress!

English

1

237

Ali@ms802x·13 Nis

@xichen_pan Nice work! Would it be possible to share the code for inference and training?

English

0

273

Xichen Pan@xichen_pan·13 Nis

@zhuole1025 @sainingxie Yeah, thanks to the great Magiclens @DrogoKhal4, It was proposed for retrieval, but we found it also works very well if you have a very good base model.

English

0

95

Le Zhuo@zhuole1025·13 Nis

@xichen_pan @sainingxie Thanks for your explanation! I really like the insights of collecting instruction tuning image pairs from in-the-wild data. Maybe this is the key to letting the model learn some "intelligence" beyond simple conditional generation tasks (like controlnet tasks), in a scalable way.

English

0

2

100

Saining Xie@sainingxie·12 Nis

Our take on a 4o-style AR + diffusion unified model: Transferring knowledge from an AR LLM to generation is easier than expected--you don't even need to touch the LLM. The right bridge between output modalities can unlock cool capabilities like knowledge-augmented generation!

Xichen Pan@xichen_pan

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

English

5

28

260

19.8K

Xichen Pan@xichen_pan·13 Nis

@jimwinkens @jiang_zhengkai we already tuned and it's not that accurate. one simple engineering hack can be shortcut vae features to diffusion model

English

1

109

Jim Winkens@jimwinkens·13 Nis

@xichen_pan @jiang_zhengkai What is the path forward for pixel aligned editing in such models, would simply tuning on specialized data work?

English

0

84

Xichen Pan@xichen_pan·12 Nis

@__JohnNguyen__ @sainingxie Yeah, I think it’s a great idea to start from a frozen MLLM to build unified models. Either approach works smoothly, as there’s no need to adjust the recipe or data mix anymore. The good thing is that the capabilities can be still transferred from the frozen MLLM.

English

1

71

John Nguyen@__JohnNguyen__·12 Nis

@sainingxie LMFusion found something similar, by freezing an MLLM you can preserve the image understanding ability then tune the decoder to do generation arxiv.org/pdf/2412.15188

English