Kevin Lin

🚀🚀Excited to introduce GenXD: Generating Any 3D and 4D Scenes! A joint framework for general 3D and 4D generation, supporting both object-level and scene-level generation. Project Page: gen-x-d.github.io Arxiv: arxiv.org/abs/2411.02319

21

62

6.5K

Kevin Lin@linkeyun2·6 Kas

Happy to share our recent work on GenXD!

Yuyang Zhao@yuyangzhao_

English

4

114

Kevin Lin retweetledi

Yan@AnYan_ai·6 Eki

I am attending #COLM2024 in Philly! Will present our paper “List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs” on Monday morning ⏰ Come and chat if you are interested in multimodal LLMs, synthetic data and training recipes!

English

4

24

2.6K

Kevin Lin retweetledi

OpenAI@OpenAI·8 Ağu

We’re rolling out the ability for ChatGPT Free users to create up to two images per day with DALL·E 3. Just ask ChatGPT to create an image for a slide deck, personalize a card for a friend, or show you what something looks like.

English

383

515

3.2K

661.1K

Kevin Lin retweetledi

OpenAI@OpenAI·3 Nis

You can now edit DALL·E images in ChatGPT across web, iOS, and Android.

English

297

1K

5.8K

1.6M

Kevin Lin@linkeyun2·16 Oca

Our work was accepted at ICLR 2024! Paper and reviews: openreview.net/forum?id=J44Hf… Code & data: github.com/FuxiaoLiu/LRV-…

Aligning Large Multi-Modal Model with Robust Instruction Tuning paper page: huggingface.co/papers/2306.14… Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model.

English

3

16

2.1K

Kevin Lin@linkeyun2·14 Kas

Thanks for sharing! @_akhaliq 📲We present MM-Navigator, an agent system built on GPT-4V for smartphone GUI navigation. 📲MM-Navigator incorporates action histories and set-of-mark tags to produce precise executable actions. Project page: github.com/zzxslp/MM-Navi…

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation paper page: huggingface.co/papers/2311.07… present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task.

English

5

32

16K

Kevin Lin@linkeyun2·1 Kas

@YamanKSingla @_akhaliq Great work! Thank you for sharing @YamanKSingla

English

Yaman Kumar Singla@YamanKSingla

1

46

Yaman Kumar Singla@YamanKSingla·1 Kas

@linkeyun2 @_akhaliq Hey! Kevin. Nice work... Do checkout this related work by our team in similar settings: twitter.com/YamanKSingla/s…

Accepted at EMNLP23.

English

0

1

352

Kevin Lin@linkeyun2·31 Eki

Thanks for featuring our work! @_akhaliq We explore GPT-4V for many interesting video tasks Check out our recent works: 📢 MM-VID: multimodal-vid.github.io 📢 DEsignBench: design-bench.github.io 📢 Idea2Img: idea2img.github.io 📢 The Dawn of LMMs: arxiv.org/abs/2309.17421

MM-VID: Advancing Video Understanding with GPT-4V(ision) paper page: huggingface.co/papers/2310.19… present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

English

2

14

57

20.3K

Kevin Lin retweetledi

AK@_akhaliq·31 Eki

MM-VID: Advancing Video Understanding with GPT-4V(ision) paper page: huggingface.co/papers/2310.19… present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

English

Can we MERGE weights of different MODALITIES? The answer is no using naive merging. However we find an effective recipe for improving merging results significantly in “An Empirical Study of Multimodal Model Merging” arxiv.org/abs/2304.14933 🧵👇 @linjiefun @zhegan4 @mohitban47

66

267

73K

Kevin Lin retweetledi

AK@_akhaliq·13 Eki

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation paper page: huggingface.co/papers/2310.08… introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.

English

4

82

342

69.7K

Kevin Lin retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·2 Eki

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) Link: arxiv.org/abs/2309.17421 A 166-page report from Microsoft qualitatively exploring GPT-4V capabilities and usage. Describes visual+text prompting techniques, few-shot learning, reasoning, etc. Looks like it will be a must-read for GPT-4V power users 👀

Tanishq Mathew Abraham, Ph.D. tweet media

English

13

155

653

180.5K

Kevin Lin retweetledi

Yi Lin Sung@yilin_sung·12 Eki

🚨 Multimodal Model Merging is accepted to #EMNLP2023 findings! Check out our camera-ready version with exps on more tasks & architectures, and a new metric to better predict whether two models are mergeable --> arxiv.org/abs/2304.14933 @linjiefun @linkeyun2 @zhegan4 @mohitban47

Yi Lin Sung@yilin_sung

English

21

70

7.2K

Kevin Lin retweetledi

AK@_akhaliq·7 Ağu

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities paper page: huggingface.co/papers/2308.02… propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

English

21

108

28.7K

Kevin Lin retweetledi

Tan Wang@Wangt97·2 Eki

Excited to share our EqBen/EqSim, a new benchmark/algorithm focusing on evaluating and improving the similarity measure of V&L foundation models, to be presented in Oral session at #ICCV2023 in Paris! Joint work w/ @linkeyun2 @LINJIEFUN CC Lin, ZY Yang, HW Zhang, ZC Liu, LJ Wang.

Paris, France 🇫🇷 English

2

9

1.2K

Kevin Lin retweetledi

Tan Wang@Wangt97·9 Tem

I will be in @icvss for the upcoming week presenting our DisCo (disco-dance.github.io), with interactive demo to turn static images into human dancing videos! Big thanks to @GMFarinella , @robertocipolla , @sebattiato for organizing ICVSS.

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

English

2

8

3.4K

Kevin Lin retweetledi

Linjie (Lindsey) Li@LINJIEFUN·7 Tem

We are hiring full-time/part-time research interns all year round. If you are interested, please send your resume to linjli@microsoft.com.

English

21

62

498

97.9K

Kevin Lin@linkeyun2·4 Tem

Thanks @_akhaliq for sharing our work! Please find more details below. Github Page: github.com/Wangt-CN/DisCo Arxiv: arxiv.org/abs/2307.00040 Huggingface: huggingface.co/papers/2307.00… Youtube: youtu.be/alJKsj3JpBo

YouTube

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

English

115

Kevin Lin retweetledi

Tan Wang@Wangt97·4 Tem

Thx @_akhaliq! Check out our DisCo at disco-dance.github.io.🔥🔥🔥 🧙‍♂️High Generalizability. No need human-specific fine-tuning! 💃Extensive human-related applications with disentangled control! 👨‍💻Easy-to-follow framework and totally opensource code!