Di Chang

83 posts

Di Chang

@DiChang10

PhD @CSatUSC｜VSR @Stanford | Intern @MetaAI｜Prev. ByteDance Seed @TikTok_US @EPFL @TU_Muenchen｜ Working on Multimodal Generative Models｜

Palo Alto, CA Katılım Haziran 2021

1.3K Takip Edilen1.1K Takipçiler

Di Chang retweetledi

Fangjinhua Wang@FangjinhuaWang·23 Mar

Excited to share our new survey on 3D Reconstruction, accepted to IEEE T-PAMI! We cover everything from depth estimation and NeRF to 3DGS and 3D foundation models. If you're interested in 3D reconstruction, you won't want to miss this.arxiv.org/pdf/2408.15235 #3D #ComputerVision

English

187

9.9K

Di Chang retweetledi

Prime (Shengqu) Cai@prime_cai·21 Eyl

Some random thoughts I've been having about video world model/long video generation since working on Mixture of Contexts (whose title could also be "Learnable Sparse Attention for Long Video Generation"): 🚨Semi-long Post Alert🚨 1. Learnable sparse attention is still underrated for video, 3D/4D, and world models. - Different from text: text often hinges on single-token dependencies; video almost never does. Visual signals of interest form patch/tube structures that persist and evolve across frames. - The wrong mental model: the “needle-in-a-haystack” token recall test for LLMs doesn’t map to video. Long video rarely needs recalling a lone token from ages ago. And because viewpoint, lighting, scale, occlusion, articulation, motion blur, and even edits change substantially, there is no invariant single "token" to recover. - Visual contents are physically structural: continuity, locality, bounded acceleration, and limited parallax—these drastically shrink the search space. Targets always reappear across multiple frames and move predictably. - Compression vs Sparsity? Compression is blunt for space-time recurrence. Learnable sparsity directly routes computation to the recurring, structured signal instead of risking loss of fine but persistent cues. For visual domains, learnable sparsity might be more suitable than compression-centric strategies. But they are not orthogonal; we use some sort of naive "attention sink" in MoC, which is a form of compression, and it helps. 2. What should “memory/context/state/history” mean for long video generation or video world models? - We want context that supports a self-evolving world state (in spirit with @ylecun 's view). - After scaling up, merely achieving scene/character consistencies becomes a relatively trivial task. Our MoC works, Context-as-Memory works, TTT/LaCT works, nano-banana also works. - What we need is a more expressive context ability, like the following simple behavioral test: a car enters from the left; the camera looks away; when it returns, the car should have advanced plausibly. That requires a state that evolves off-screen that enables the deduction of what is happening. - This requires something beyond 3D caches. Pure 3D memory (geometry/appearance) doesn’t carry ongoing events through occlusions or FOV changes. We need an evolving 4D latent state tracking identity, pose, momentum, interactions, and constraints—i.e., “what’s going on” even when unseen. This also means we need more than a memory bank. Consistency of characters/assets isn’t enough; we need state transitions that continue (even while unobserved). - This doesn't mean using an SSM, but means placing a deductive step in the model. Full attention can do it well given sufficient data since they are essentially dynamic graphs, but it becomes intractable at long contexts, so learnable sparsity matters—it's a core motivation for us to do MoC. 3. Algorithms are not a problem to handle "memory" in video world models, (video-action paired) data is. - We largely know how to represent and route long-range visual context. The hard part is data: we need video–action/interaction-paired data that stresses long-horizon prediction: persistent identity, occlusions, off-screen dynamics, multi-agent interactions. - This mirrors the difficult VLA challenge: scalable, high-quality interaction data is the real rate limiter for grounded state evolution and robust deduction. Luckily, we may not have that much of a Sim2Real gap under the context of Video World Models. 4. What is the role of explicit/3D then? I side with purely implicit, data-driven approaches, so explicit/3D stuff will be in data and alignment, but not as the model's foundation. 5. The future is a unified model. - A unified model is the most direct way to put that deduction step in the right place—the semantic representation space—and train it end-to-end. - Borrow more, borrow better: shared representations let the model transfer motion priors, physics, and identity persistence across tasks/modalities. And it will be easier to borrow MORE stuff BETTER from the years of efforts from the LLM community ;) - Consistent routing/compression: unified training yields stable sparsity policies (what to attend to, when, and how) across tasks. - Richer supervision: multi-task signals sharpen the evolving latent state and improve long-horizon deduction ability. There is still much to be done.

Gordon Wetzstein@GordonWetzstein

How do we generate videos on the scale of minutes, without drifting or forgetting about the historical context? We introduce Mixture of Contexts. Every minute-long video below is the direct output of our model in a single pass, with no post-processing, stitching, or editing. 1/4

English

240

46.4K

Di Chang retweetledi

AK@_akhaliq·14 Nis

ByteDance just announced Seaweed-7B on Hugging Face Cost-Effective Training of Video Generation Foundation Model

English

104

577

56.7K

Di Chang retweetledi

AK@_akhaliq·28 Mar

Feature4X Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

English

13.9K

Di Chang@DiChang10·24 Ara

@yuliangxiu Congrats🥳🎉🍾羡慕😭😭

中文

326

Yuliang Xiu@yuliangxiu·24 Ara

Married!

English

493

27.5K

Di Chang@DiChang10·21 Ara

@JiaweiYang118 @NVIDIAAI @CSatUSC @USCViterbi @yuewang314 @drmapavone @iamborisi Congrats🤩Well deserved Jiawei👍tql

Română

229

Jiawei Yang@JiaweiYang118·19 Ara

🙌 Honored and grateful to receive the 2025-2026 @NVIDIAAI Graduate Fellowship as a PhD student at @CSatUSC @USCViterbi. This award is a testament to the incredible mentorship, collaboration, and support I’ve been fortunate to receive from @yuewang314, @drmapavone, @iamborisi, and many others. Grateful to be part of this journey, and excited to keep learning and contributing. 🙏 Congrats to other recipients as well! 🚀 blogs.nvidia.com/blog/graduate-…

English

146

14.2K

Di Chang@DiChang10·4 Ara

Rising Star of CV at USC🤩

Yue Wang@yuewang314

[Hiring!] I am hiring multiple PhDs @CSatUSC @USCViterbi for this cycle. If you're interested in scene representations, neural simulation, generative AI, and robotics, feel free to mention my name in your application (no need to email). For USC masters/undergrads who're interested in our research, feel free to fill in this form forms.gle/RerZfDqCqmCj8A….

English

1.5K

Di Chang retweetledi

Ruiqi Gao@RuiqiGao·2 Ara

A common question nowadays: Which is better, diffusion or flow matching? 🤔 Our answer: They’re two sides of the same coin. We wrote a blog post to show how diffusion models and Gaussian flow matching are equivalent. That’s great: It means you can use them interchangeably.

English

199

945

172.4K

Di Chang@DiChang10·8 Eki

The demo is cool af!🤩 Amazing work

Junyi Zhang@junyi42

Excited to share MonST3R! -- a simple way to estimate geometry from unposed video of dynamic scene We achieve competitive results on several downstreams (video depth, camera pose) and believe this is a promising step toward feed-forward 4D reconstruction monst3r-project.github.io

English

1.1K

Di Chang@DiChang10·28 Eyl

Slides here 👉 docs.google.com/presentation/d…

English

316

Di Chang@DiChang10·28 Eyl

First time back at @EPFL_en after two years! Glad to be invited and have a guest talk on human-centric generative vision models at EPFL School of Computer and Communication Sciemces @ICepfl hosted by IVRL and CVLab. Huge thanks to my great researcher friends @TongZhang1024 and @chenzhao0220 for hosting 🥰 It has been such a wonderful experience to share and have discussions on my recent works on 2D Human Animation, Talking-Head Video Generation and Dynamic 3D Motion Generation with the talented researchers here. The talk covered topics of recent advances in generative models for human behaviors and key challenges & future research problems to be explored(slides shared in comments). As always, the beautiful view of Geneva lake and the vision research community make me love EPFL 😋 Heading to Milan for ECCV 2024 @eccvconf today and see ya all at the MiCo🥸 Feel free to find me at my poster on 10.4 10:30am-12:30pm at # 283 and have some juicy discussions.😎 🥲Sad that ICLR ddl is approaching right in the middle of ECCV and I’m still not done with the submission 😅

English

547

Di Chang@DiChang10·28 Ağu

Check out our most recent survey on MVS！🥸

Zhenjun Zhao@zhenjun_zhao

Learning-based Multi-View Stereo: A Survey @FangjinhuaWang, Qingtian Zhu, @DiChang10, @UUUUUsher, @han_junlin, Tong Zhang, Richard Hartley, @mapo1 tl;dr: review of learning-based MVS methods until 2023 arxiv.org/pdf/2408.15235

English

839

Di Chang@DiChang10·31 Tem

@UUUUUsher I enjoy doing research in this community too much.😊

English

361

Di Chang@DiChang10·28 Tem

Back to Munich and Department of Informatics @TU_Muenchen for the first time after two years of my PhD @CSatUSC It has always been a good memory of studying and conducting research here😇 I'm glad to see the PhD students from my previous group are still working on their projects/rebuttals till 8PM on Friday🤣 to make TUM great again!

English

2.6K

Di Chang@DiChang10·21 Tem

Arrived in Vienna😇Feel so good to be back in EU, I finally have a chance to speak German after two years🤣 Please find me at my poster on Wednesday Hall C 4-9 #101 11:30am-1:00pm🤝

Di Chang@DiChang10

I'm thrilled to share that MagicPose (formerly known as MagicDance) will be presented at #ICML2024! MagicPose (MagicDance) is a diffusion-based model for 2D human pose and facial expression retargeting. It enables robust appearance control over generated human images, including body, facial attributes, and background. By leveraging the prior knowledge of image diffusion models, MagicPose generalizes well to unseen human identities and complex poses without the need for additional fine-tuning. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. Our code has been fully open-sourced and please kindly gives the repo a star 🌟 if you find our project interesting🥰 💻 paper: arxiv.org/abs/2311.12052 🔗website: boese0601.github.io/magicdance/ ⌨️ code: github.com/Boese0601/Magi… MagicPose was my summer intern project at @tiktok_us. A huge thanks to my mentors Yichun Shi, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang from TikTok. And I appreciate the help from my friend Quankai Gao @UUUUUsher and faculty advisor @msoleymani at @CSatUSC . See you in Vienna😘

English

2.4K

Di Chang@DiChang10·2 Tem

@sabeerawa05 Thanks Sabeer😊

English

Sabeer Saeed@sabeerawa05·2 Tem

@DiChang10 Congratulations

English

Di Chang@DiChang10·2 Tem

Glad to share our new work on speaker and listener head generation, Dyadic Interaction Modeling for Social Behavior Generation, has been accepted by #ECCV2024! TL;DR: We propose Dyadic Interaction Modeling, a pre-training strategy that jointly models speakers’ and listeners’ motions and learns representations that capture the dyadic context. Our code has been fully open-sourced and please kindly gives the repo a star 🌟 if you find our project interesting🥰 💻 paper: arxiv.org/abs/2403.09069 🔗website: boese0601.github.io/dim/ ⌨️ code: github.com/Boese0601/Dyad… A huge thanks to my amazing collaborators for their hard work, including my lab mates Minh Tran, Maksim Siniukov and faculty advisor @msoleymani at @CSatUSC and @USC_ICT. See you in Milan, Italy! 😘 #eccv #genai #computervision #videogeneration #behaviorgeneration #talkinghead

English

3.5K

Di Chang@DiChang10·16 Haz

Heading to CVPR in Seattle.🤩 During the Conference, I’ll 1) present our CVPR highlight work DiffPortrait3D 👹 on Thursday Poster#77 from 10:30am to noon and 2) present our most recent works and demos from TikTok Digital Human💃 & Avatar 👶Group on Friday from 10am to 3pm, including Realistic/Stylized Avatar, Audio-Driven Emotional Animation, Highly Expressive Motion Transfer, Virtual Try on, and share technical details. Please come and find me during these sessions if you share a similar interest in these areas😛. I'm also actively looking for collaborators 😋from industry/academia to work on 4D/Video Generation with diffusion models. Always open to coffee ☕️chats anytime. See one of our recent demos on Avatar Animation below:

English

1.2K

Keşfet

@ylecun @yuliangxiu @JiaweiYang118 @NVIDIAAI @CSatUSC @USCViterbi @yuewang314 @drmapavone