
Hsin-Ying Lee
176 posts




Can pretrained diffusion models connect for cross-modal generation? 📢 Introducing AV-Link ♾ Bridging unimodal diffusion models in one framework to enable: 📽️ ➡️ 🔊 Video-to-Audio 🔊 ➡️ 📽️ Audio-to-Video 🌐: snap-research.github.io/AVLink/ 📄: hf.co/papers/2412.15… ⤵️ Results



📢📢 𝐏𝐫𝐄𝐝𝐢𝐭𝐨𝐫𝟑𝐃: 𝐅𝐚𝐬𝐭 𝐚𝐧𝐝 𝐏𝐫𝐞𝐜𝐢𝐬𝐞 𝟑𝐃 𝐒𝐡𝐚𝐩𝐞 𝐄𝐝𝐢𝐭𝐢𝐧𝐠 📢📢 We propose a training-free 3D shape editing approach that rapidly and precisely edits the regions intended by the user and keeps the rest as is. Using a quickly brushed mask and a text prompt, we first apply multi-view editing in the 2D domain and then run our merging algorithm in the 3D feature space to ensure that the edited shape is loyal to the input shape. Project Page: ziyaerkoc.com/preditor3d/ Video: youtube.com/watch?v=Ty2xXa… Great work by @ErkocZiya @cangumeli Chaoyang Wang @angelaqdai @peter_wonka @hyjameslee @PeiyeZ




Now HEAR this (not just watch) - We've got audio covered for generated videos 🔊 Introducing Movie Gen Audio, which adds 48kHz synced SFX and aligned music to amazing videos from Movie Gen Video (and other sources!) Super honored to work with this amazing team! More to come🔥🔥

4Real Towards Photorealistic 4D Scene Generation via Video Diffusion Models Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result,





