
Haoyu Ma
19 posts

Haoyu Ma
@HaoyumaU
MTS @ Microsoft AI | Former Research Scientist @ Meta Superintelligence Labs | PhD in CS, UC Irvine | Ex-intern at Meta/Adobe/Tencent/Baidu




🚀Thrilled to introduce ☕️MoCha: Towards Movie-Grade Talking Character Synthesis Please unmute to hear the demo audio. ✨We defined a novel task: Talking Characters, which aims to generate character animations directly from Natural Language and Speech input. ✨We propose MoCha, the first-of-its-kind DiT model capable of achieving movie-grade talking character generation. ✨MoCha enables, for the first time, Multi-character Conversations with Turn-based Dialogue generation, pushing the boundaries of automated filmmaking. Paper: arxiv.org/pdf/2503.23307 Project website: congwei1230.github.io/MoCha/

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ go.fb.me/kx1nqm 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.



🆕 Research paper from GenAI at Meta: Imagine yourself: Tuning-Free Personalized Image Generation. Research paper ➡️ go.fb.me/wre8f0 Want to try it? The feature is available now as a beta in Meta AI for users in the US.

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: huggingface.co/papers/2312.12… Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: huggingface.co/papers/2312.12… Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.





















