Mannat Singh

67 posts

Mannat Singh

Mannat Singh

@mannat_singh

Research Engineer @ Meta Superintelligence Labs. Researching and building multimodal models with a focus on media generation.

Manhattan, NY Katılım Aralık 2010
206 Takip Edilen382 Takipçiler
Aishwarya Kamath
Aishwarya Kamath@ashkamath20·
We released Gemma 4 last week, and seeing the community's response has been amazing! 🚀 Honored to lead the vision efforts in which we made huge performance leaps from Gemma 3, I wanted to help you make the most of the new capabilities. Deep dive 🧵
Aishwarya Kamath tweet media
English
27
108
907
45.8K
Mannat Singh
Mannat Singh@mannat_singh·
The US is in a state of absolute tyranny, and executions are occurring in front of our eyes. This needs to stop; everyone, from ICE officers to the President, must be brought to justice. As non-citizens, we are told to steer clear of politics, but I have to speak up - we all do!
English
0
0
1
167
Devi Parikh
Devi Parikh@deviparikh·
Excited to share a sneak peek into what we've been building at Yutori! What you see below is our trained model and internal prototype — multiple agents running in parallel in the background, completing tasks of varying complexity, relevant information and cues to step in being surfaced to the user. More examples 👇 This is barely scratching the surface of what agents can do for you day-to-day. Follow along at @yutori_ai — more to come soon!
English
45
51
464
221.1K
Mannat Singh
Mannat Singh@mannat_singh·
@koval_alvi Indeed, this is another advantage of the text VE that we don't learn simply a 1:1 mapping!
English
0
0
1
64
Mannat Singh
Mannat Singh@mannat_singh·
Flow matching can transform one distribution to another. So why do text-to-image models map noise to images instead of directly mapping text to images? Wouldn't it be cool to directly connect modalities together? CrossFlow accomplishes exactly that! cross-flow.github.io
Mannat Singh tweet media
English
2
41
321
32.8K
Mannat Singh
Mannat Singh@mannat_singh·
In fact, we find that this simple design scales *even better* than conventional FM with both model size and training steps. Lots of other details, like enabling CFG, the importance of a Variational Encoders in the paper (arxiv.org/abs/2412.15213).
Mannat Singh tweet media
English
1
0
11
1.3K
Mannat Singh retweetledi
AI at Meta
AI at Meta@AIatMeta·
As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.
English
44
217
1K
156.1K
Mannat Singh
Mannat Singh@mannat_singh·
Finally @_rohitgirdhar_ and I can talk about our detour into Llama 3 video understanding. You need to understand videos (and caption them 💬) to generate good quality videos! 🐨
English
0
0
4
244
Mannat Singh
Mannat Singh@mannat_singh·
Check out Movie Gen 🎥 Our latest media generation models for video generation, editing, and personalization, with audio generation! 16 second 1080p videos generated through a simple Llama-style 30B transformer. Demo + detailed 92 page technical report 📝⬇️
AI at Meta@AIatMeta

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ go.fb.me/kx1nqm 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

English
1
1
16
1K
Mannat Singh
Mannat Singh@mannat_singh·
Llama 3.1 is out! Through adapters we've made it multimodal, supporting images, videos, speech! Was a fun journey adding video understanding capabilities with @_rohitgirdhar_, @filipradenovic , @imisra_ and the whole MM team! P.S. MM models are WIP (not part of the release).
AI at Meta@AIatMeta

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context window and improved support for 8 languages among other improvements. Llama 3.1 405B rivals leading closed source models on state-of-the-art capabilities across a range of tasks in general knowledge, steerability, math, tool use and multilingual translation. The models are available to download now directly from Meta or @huggingface. With today’s release the ecosystem is also ready to go with 25+ partners rolling out our latest models — including @awscloud, @nvidia, @databricks, @groqinc, @dell, @azure and @googlecloud ready on day one. More details in the full announcement ➡️ go.fb.me/tpuhb6 Download Llama 3.1 models ➡️ go.fb.me/vq04tr With these releases we’re setting the stage for unprecedented new opportunities and we can’t wait to see the innovation our newest models will unlock across all levels of the AI community.

English
1
5
23
2K