Yuwei Fang

16 posts

Yuwei Fang

Yuwei Fang

@studyfang_

Principal AI Scientist @ Zoom AI. Building agentic memory & LLM post-training. Previously Snap Research & Microsoft Azure AI.

Bellevue, WA Katılım Mayıs 2016
388 Takip Edilen158 Takipçiler
Yuwei Fang retweetledi
Ziyi Wu
Ziyi Wu@Dazitu_616·
📢MinT: Temporally-Controlled Multi-Event Video Generation📢 mint-video.github.io TL;DR: We identify a fundamental failure mode of existing video generators: they cannot produce videos with sequential events. MinT unlocks this capability with temporal grounding of events. 🧵
English
12
52
189
33K
Yuwei Fang
Yuwei Fang@studyfang_·
Thanks @_akhaliq for sharing our work! Excited to share our latest work VIMI for grounded video generation! This is a great collaboration with @WilliMenapace @siarohin9013 @tsaishien_chen @kcjacksonwang @isskoro @gneubig @SergeyTulyakov ! Project page: snap-research.github.io/VIMI/
AK@_akhaliq

Snap presents VIMI Grounding Video Generation through Multi-modal Instruction Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multi-modal information. After this two-stage train-ing process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.

English
1
5
22
11.5K
Yuwei Fang
Yuwei Fang@studyfang_·
We are also excited to share our new work on minute-long video editing VIA: via-video.github.io. Excited about video generation? Come to have a chat with us.
Sergey Tulyakov@SergeyTulyakov

Stop by our @CVPR posters to meet the team! We present 7 poster today, 2 papers are highlights. Video generation, 3D scene generation, 4D generation, improving quality of synthesized images and more!

English
0
1
2
273
Yuwei Fang
Yuwei Fang@studyfang_·
Thanks @_akhaliq for sharing our work! Excited to present LoCoMo for comprehensively evaluating conversational memory with our curated very long-term conversation datasets. Full thread with all dataset/evaluation framework/methods/analysis details at: twitter.com/adyasha10/stat…
AK@_akhaliq

Snap presents Evaluating Very Long-Term Conversational Memory of LLM Agents Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

English
0
4
22
11.1K
Yuwei Fang retweetledi
Adyasha Maharana
Adyasha Maharana@adyasha10·
Can LLMs keep track of very long conversations? We evaluate 'conversational memory' of LLMs via 3 tasks on our dataset of multi-session multimodal dialogs --> LLMs struggle to remember, reason over history, draw long-range temporal/causal connections arxiv.org/abs/2402.17753 🧵
Adyasha Maharana tweet media
English
5
59
182
30.4K
Yuwei Fang
Yuwei Fang@studyfang_·
Plus, we're still looking for summer research interns in 2024. Please send your resume to yfang3@snapchat.com!
English
0
0
1
154
Yuwei Fang
Yuwei Fang@studyfang_·
Thanks @_akhaliq for sharing our work! Excited to share our latest creation ‘Snap Video’! Dive into our project page for more fun stories we’re creating. Project page: snap-research.github.io/snapvideo/ ArXiv: arxiv.org/abs/2402.14797
AK@_akhaliq

Snap Video Scaled Spatiotemporal Transformers for Text-to-Video Synthesis Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods.

English
1
6
21
11.6K
Yuwei Fang
Yuwei Fang@studyfang_·
We are hiring research interns at Snap for 2024! Our research topics range from multi-modal LLMs, Efficient DL to image/video/3D generation, personalization and editing. Please feel free to reach us directly or submit your applications here: snap-research.github.io/cv-call-for-in….
English
0
0
6
530
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Soon™, I'll be an Asst Prof @UCSanDiego @UCSD_CSE focusing on interactive & grounded AI, RL, NLP I will also be a research scientist @MosaicML helping lead efforts to make tech like RLHF more accessible Looking for PhD students & research eng/scientists to join me in ☀️SoCal🏖️
Prithviraj (Raj) Ammanabrolu tweet mediaPrithviraj (Raj) Ammanabrolu tweet mediaPrithviraj (Raj) Ammanabrolu tweet media
English
71
40
528
190.9K
Yuwei Fang
Yuwei Fang@studyfang_·
@bebensee_ I was at the conference today. Please DM me directly tomorrow. See you soon!
English
0
0
1
18
Björn Bebensee
Björn Bebensee@bebensee_·
@studyfang_ When would be a good time for you to chat? Are you at the conference today? Would love to learn more about NLP research at Snap :)
English
1
0
0
17
Yuwei Fang
Yuwei Fang@studyfang_·
It will be my first time to attend #ACL2023 in person! So excited! Anyone interested in having a chat? Let’s meet then! Besides, we are hiring research scientists and interns to work on Multimodal and NLP at Snap Research. If you are interested, let’s talk about it!
English
4
1
24
3.3K
🇺🇦 Alex Polozov
🇺🇦 Alex Polozov@Skiminok·
A week ago, I signed out of my Google corp account, departing on a 3-month personal leave. I'm writing here to thank Google - and my entire management chain - wholeheartedly for this opportunity. It's rare in the US. Especially being at the company less than 2 years!
English
5
0
109
25.2K
Björn Bebensee
Björn Bebensee@bebensee_·
@studyfang_ First time attending ACL in person for me too :-) Would love to have a chat!
English
1
0
0
54