Kevin Lin

22 posts

Kevin Lin banner
Kevin Lin

Kevin Lin

@linkeyun2

Senior Researcher @Microsoft; PhD @UW; Computer Vision, Machine Learning

Seattle, Washington Katılım Haziran 2022
131 Takip Edilen106 Takipçiler
Kevin Lin retweetledi
Yan
Yan@AnYan_ai·
I am attending #COLM2024 in Philly! Will present our paper “List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs” on Monday morning ⏰ Come and chat if you are interested in multimodal LLMs, synthetic data and training recipes!
Yan tweet media
English
1
4
24
2.6K
Kevin Lin retweetledi
OpenAI
OpenAI@OpenAI·
We’re rolling out the ability for ChatGPT Free users to create up to two images per day with DALL·E 3. Just ask ChatGPT to create an image for a slide deck, personalize a card for a friend, or show you what something looks like.
English
383
515
3.2K
661.1K
Kevin Lin retweetledi
OpenAI
OpenAI@OpenAI·
You can now edit DALL·E images in ChatGPT across web, iOS, and Android.
English
297
1K
5.8K
1.6M
Kevin Lin
Kevin Lin@linkeyun2·
Our work was accepted at ICLR 2024! Paper and reviews: openreview.net/forum?id=J44Hf… Code & data: github.com/FuxiaoLiu/LRV-…
AK@_akhaliq

Aligning Large Multi-Modal Model with Robust Instruction Tuning paper page: huggingface.co/papers/2306.14… Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model.

English
0
3
16
2.1K
Kevin Lin
Kevin Lin@linkeyun2·
Thanks for sharing! @_akhaliq 📲We present MM-Navigator, an agent system built on GPT-4V for smartphone GUI navigation. 📲MM-Navigator incorporates action histories and set-of-mark tags to produce precise executable actions. Project page: github.com/zzxslp/MM-Navi…
AK@_akhaliq

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation paper page: huggingface.co/papers/2311.07… present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task.

English
0
5
32
16K
Kevin Lin
Kevin Lin@linkeyun2·
Thanks for featuring our work! @_akhaliq We explore GPT-4V for many interesting video tasks Check out our recent works: 📢 MM-VID: multimodal-vid.github.io 📢 DEsignBench: design-bench.github.io 📢 Idea2Img: idea2img.github.io 📢 The Dawn of LMMs: arxiv.org/abs/2309.17421
AK@_akhaliq

MM-VID: Advancing Video Understanding with GPT-4V(ision) paper page: huggingface.co/papers/2310.19… present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

English
2
14
57
20.3K
Kevin Lin retweetledi
AK
AK@_akhaliq·
MM-VID: Advancing Video Understanding with GPT-4V(ision) paper page: huggingface.co/papers/2310.19… present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.
AK tweet media
English
1
66
267
73K
Kevin Lin retweetledi
AK
AK@_akhaliq·
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation paper page: huggingface.co/papers/2310.08… introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.
AK tweet media
English
4
82
342
69.7K
Kevin Lin retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) Link: arxiv.org/abs/2309.17421 A 166-page report from Microsoft qualitatively exploring GPT-4V capabilities and usage. Describes visual+text prompting techniques, few-shot learning, reasoning, etc. Looks like it will be a must-read for GPT-4V power users 👀
Tanishq Mathew Abraham, Ph.D. tweet mediaTanishq Mathew Abraham, Ph.D. tweet media
English
13
155
653
180.5K
Kevin Lin retweetledi
Yi Lin Sung
Yi Lin Sung@yilin_sung·
🚨 Multimodal Model Merging is accepted to #EMNLP2023 findings! Check out our camera-ready version with exps on more tasks & architectures, and a new metric to better predict whether two models are mergeable --> arxiv.org/abs/2304.14933 @linjiefun @linkeyun2 @zhegan4 @mohitban47
Yi Lin Sung@yilin_sung

Can we MERGE weights of different MODALITIES? The answer is no using naive merging. However we find an effective recipe for improving merging results significantly in “An Empirical Study of Multimodal Model Merging” arxiv.org/abs/2304.14933 🧵👇 @linjiefun @zhegan4 @mohitban47

English
1
21
70
7.2K
Kevin Lin retweetledi
AK
AK@_akhaliq·
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities paper page: huggingface.co/papers/2308.02… propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.
AK tweet media
English
1
21
108
28.7K
Kevin Lin retweetledi
Tan Wang
Tan Wang@Wangt97·
Excited to share our EqBen/EqSim, a new benchmark/algorithm focusing on evaluating and improving the similarity measure of V&L foundation models, to be presented in Oral session at #ICCV2023 in Paris! Joint work w/ @linkeyun2 @LINJIEFUN CC Lin, ZY Yang, HW Zhang, ZC Liu, LJ Wang.
Tan Wang tweet media
Paris, France 🇫🇷 English
1
2
9
1.2K
Kevin Lin retweetledi
Tan Wang
Tan Wang@Wangt97·
I will be in @icvss for the upcoming week presenting our DisCo (disco-dance.github.io), with interactive demo to turn static images into human dancing videos! Big thanks to @GMFarinella , @robertocipolla , @sebattiato for organizing ICVSS.
AK@_akhaliq

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

English
1
2
8
3.4K
Kevin Lin retweetledi
Linjie (Lindsey) Li
Linjie (Lindsey) Li@LINJIEFUN·
We are hiring full-time/part-time research interns all year round. If you are interested, please send your resume to linjli@microsoft.com.
English
21
62
498
97.9K
Kevin Lin
Kevin Lin@linkeyun2·
Thanks @_akhaliq for sharing our work! Please find more details below. Github Page: github.com/Wangt-CN/DisCo Arxiv: arxiv.org/abs/2307.00040 Huggingface: huggingface.co/papers/2307.00… Youtube: youtu.be/alJKsj3JpBo
YouTube video
YouTube
AK@_akhaliq

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

English
0
0
0
115
Kevin Lin retweetledi
Tan Wang
Tan Wang@Wangt97·
Thx @_akhaliq! Check out our DisCo at disco-dance.github.io.🔥🔥🔥 🧙‍♂️High Generalizability. No need human-specific fine-tuning! 💃Extensive human-related applications with disentangled control! 👨‍💻Easy-to-follow framework and totally opensource code!
AK@_akhaliq

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

English
2
6
8
3.8K
Kevin Lin retweetledi
AK
AK@_akhaliq·
DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: huggingface.co/papers/2307.00… Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.
English
5
116
449
161.4K