Haoyu Ma (@HaoyumaU) - Twitter Profili | Zamantika Mersobahis Locabet

Haoyu Ma retweetledi

Three models. Three top-tier results. All shipped within just a few months by the @MicrosoftAI team. - MAI-Transcribe-1 dropped today, the most accurate transcription model in the world across 25 languages according to FLEURS WER benchmark. - MAI-Voice-1 sets a new standard for natural speech. - MAI-Image-2 lands as a top 3 model family on @arena. We've been building with them - now you can too. All 3 available now on Microsoft Foundry.

English

44

93

527

72K

Haoyu Ma retweetledi

Leon Chen@realleonlc·14 Şub

🚀 Introducing our fresh work at Stanford and Meta MSL: UniT — Unified Multimodal Chain-of-Thought Test-time Scaling What if a single model could generate an image, look at it, think about what's wrong, and fix it — all by itself? That's exactly what UniT does. 🧵👇

English

4

29

159

24.4K

Haoyu Ma retweetledi

Felix Juefei Xu@felixudr·8 Şub

As generative models move toward real-world deployment, efficiency becomes a first-class research problem. The 3rd EDGE Workshop @ CVPR 2026 — Efficient & On-Device Generation focuses on: ⚡ Efficient training & inference 📱 On-device multimodal generation 🎬 Real-time image & video models 🌍 Scalable, deployable GenAI Submit & learn more: cvpr26-edge.github.io @CVPR #CVPR2026 #EdgeAI #GenerativeModels #EfficientML

English

2

9

44

6.5K

Haoyu Ma@HaoyumaU·3 Ara

@LiangJeff95 Same, sad

English

0

45

Jeff Liang@LiangJeff95·3 Ara

哎，我实在是太想去开会了。 SAD。

中文

3

0

11

1.1K

Haoyu Ma retweetledi

AK@_akhaliq·25 Nis

Token-Shuffle Towards High-Resolution Image Generation with Autoregressive Models

English

4

20

187

17.3K

Haoyu Ma retweetledi

Min Choi@minchoi·2 Nis

Meta just announced MoCha This AI can create full movie-quality talking & singing characters from just speech & text. 10 wild examples: 1. Talking Characters

English

118

244

1.6K

420K

Haoyu Ma retweetledi

AK@_akhaliq·1 Nis

Meta announces MoCha Towards Movie-Grade Talking Character Synthesis

English

45

150

930

204.4K

Haoyu Ma@HaoyumaU·1 Nis

Excited to share our latest work, MoCha☕️, led by our incredible intern @CongWei1230! MoCha takes talking-character generation to the next level—unlike previous I2V-based methods, it directly generates movie-grade single- or multi-character videos from raw speech and text input!

Cong Wei@CongWei1230

🚀Thrilled to introduce ☕️MoCha: Towards Movie-Grade Talking Character Synthesis Please unmute to hear the demo audio. ✨We defined a novel task: Talking Characters, which aims to generate character animations directly from Natural Language and Speech input. ✨We propose MoCha, the first-of-its-kind DiT model capable of achieving movie-grade talking character generation. ✨MoCha enables, for the first time, Multi-character Conversations with Turn-based Dialogue generation, pushing the boundaries of automated filmmaking. Paper: arxiv.org/pdf/2503.23307 Project website: congwei1230.github.io/MoCha/

English

1

0

1

165

Haoyu Ma@HaoyumaU·4 Eki

Super proud to be part of this amazing project, especially for the video personalization!

AI at Meta@AIatMeta

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ go.fb.me/kx1nqm 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

English

0

3

200

Haoyu Ma retweetledi

AK@_akhaliq·23 Eyl

Meta presents Imagine yourself Tuning-Free Personalized Image Generation paper page: huggingface.co/papers/2409.13… Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.

English

7

73

384

41.2K

Haoyu Ma@HaoyumaU·20 Ağu

Excited to share the first project I’ve worked on since I joined GenAI, Meta. I am very proud to work with such an amazing team. Try it in IG, Messenger, and meta.ai.

AI at Meta@AIatMeta

🆕 Research paper from GenAI at Meta: Imagine yourself: Tuning-Free Personalized Image Generation. Research paper ➡️ go.fb.me/wre8f0 Want to try it? The feature is available now as a beta in Meta AI for users in the US.

English

0

80

Haoyu Ma retweetledi

Danny Trinh@dtrinh·23 Tem

We put a very (very!) fun thing in Meta AI today. Say "imagine me..." to see yourself anywhere your heart desires. If you're weird like me, you might imagine yourself with a magical emu. 🔜 try it in IG, Messenger, and meta.ai — @AIatMeta

English

27

16

268

64.3K

Haoyu Ma@HaoyumaU·27 Şub

We are thrilled to announce that MaskINT (maskint.github.io) has been accepted by #CVPR2024 ! See you all in Seattle!

AK@_akhaliq

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: huggingface.co/papers/2312.12… Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

English

1

2

13

1.1K

Haoyu Ma@HaoyumaU·21 Ara

Thanks @_akhaliq for sharing our work. MaskINT disentangle video editing into key frame editing stage and structure-aware frame interpolation stage. With the benefit of masked transformers, our method achieve 5-7 times acceleration in inference time than pure diffusion models.

AK@_akhaliq

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: huggingface.co/papers/2312.12… Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

English

0

4

222

Haoyu Ma@HaoyumaU·3 Mar

#WACV2020 Snowmass Village is really pretty! And I made so many new friends here!

English

0

6

0

Haoyu Ma retweetledi

arxiv@arxiv_org·8 Şub

Rotation-invariant Mixed Graphical Model Network for 2D Hand Pose Estimation. arxiv.org/abs/2002.02033

English

0

20

30

0

Haoyu Ma@HaoyumaU·29 Oca

Thank you for sharing our paper on Twitter!

arxiv@arxiv_org

Nonparametric Structure Regularization Machine for 2D Hand Pose Estimation. arxiv.org/abs/2001.08869

English

0

Haoyu Ma@HaoyumaU·29 Oca

The code for our WACV 2020 paper 'Nonparametric Structure Regularization Machine for 2D Hand Pose Estimation' has been released 😋 ! #wacv2020 arxiv.org/pdf/2001.08869… github.com/HowieMa/NSRMha…

English

0

4

0

Haoyu Ma retweetledi

William Wang@WilliamWangNLP·22 Oca

Attending your first academic conference...

English

2

35

336

0

Haoyu Ma

Keşfet