Yifan Yang

48 posts

Yifan Yang banner
Yifan Yang

Yifan Yang

@Yif_Yang

Multimodality researcher from Microsoft Research Asia.

Shanghai Katılım Temmuz 2017
88 Takip Edilen30 Takipçiler
Sabitlenmiş Tweet
Yifan Yang
Yifan Yang@Yif_Yang·
🌟 Tired of CLIP's limitations and short input windows? ✨ Meet LLM2CLIP—our secret to making the SOTA CLIP model even more SOTA! By enabling LLMs to act as CLIP's "teacher," we achieve significant performance gains with minimal data and training. We found that LLMs struggle to distinguish image captions, making CLIP training confusing. But with our caption-to-caption contrastive finetuning, LLMs reveal their text comprehension in output features, becoming the ideal mentor for CLIP. 🧑‍🏫 LLM2CLIP overcomes CLIP’s weaknesses: limited text understanding, bag-of-words-like behavior, short input windows, and structural challenges with dense, complex captions. With LLM's open-world knowledge, we maximize CLIP’s capacity on dense captions, achieving efficient and robust training across various text retrievals and LLava benchmarks. 🚀 📄 Paper: huggingface.co/papers/2411.04… review and already accept at NeurIPS 2024 SSL Workshop. 🔗 Models & Code: aka.ms/llm2clipReady to give your CLIP a "super private tutor" or use our top-performing CLIP model! #LLM2CLIP #CLIP #MachineLearning #AI #NeurIPS2024 #ContrastiveLearning #DeepLearning #AIResearch
English
2
3
31
8.2K
Yifan Yang
Yifan Yang@Yif_Yang·
Thanks AK for sharing our work! 🚀 Excited to share our BizGenEval — the first systematic benchmark for commercial visual content generation 📄 Paper: huggingface.co/papers/2603.25… 🌐 Project: aka.ms/bizGenEval 💻 Code: github.com/microsoft/BizG… Unlike prior benchmarks focusing on natural images, we evaluate real-world design tasks: 🧩 5 domains: Slides, Webpages, Charts, Posters, Scientific Figures 🧠 4 capabilities: Text Rendering, Layout Control, Attribute Binding, Knowledge Reasoning ➡️ 20 tasks in total We curated: 400 prompts (300 real-world + 100 knowledge-intensive) 8,000 fine-grained checklist questions Evaluated 26 SOTA models (Nano-Banana, GPT-Image, Seedream, QwenImage, etc.) 📊 Key finding: Strong performance on natural-image benchmarks ≠ real commercial design ability Even top models still struggle with: precise layout control attribute binding multi-constraint reasoning 💡 As generative models become design tools, measuring these capabilities becomes critical. Hope this benchmark helps push the next wave of practical multimodal generation 🙌
English
0
0
2
41
Yifan Yang retweetledi
AK
AK@_akhaliq·
BizGenEval A Systematic Benchmark for Commercial Visual Content Generation paper: huggingface.co/papers/2603.25…
AK tweet media
English
4
11
31
4.6K
Yifan Yang
Yifan Yang@Yif_Yang·
🚀 Excited to share our latest work: BizGenEval — the first systematic benchmark for commercial visual content generation 📄 Paper: huggingface.co/papers/2603.25… 🌐 Project: aka.ms/bizGenEval 💻 Code: github.com/microsoft/BizG… Unlike prior benchmarks focusing on natural images, we evaluate real-world design tasks: 🧩 5 domains: Slides, Webpages, Charts, Posters, Scientific Figures 🧠 4 capabilities: Text Rendering, Layout Control, Attribute Binding, Knowledge Reasoning ➡️ 20 tasks in total We curated: 400 prompts (300 real-world + 100 knowledge-intensive) 8,000 fine-grained checklist questions Evaluated 26 SOTA models (Nano-Banana, GPT-Image, Seedream, QwenImage, etc.) 📊 Key finding: Strong performance on natural-image benchmarks ≠ real commercial design ability Even top models still struggle with: precise layout control attribute binding multi-constraint reasoning 💡 As generative models become design tools, measuring these capabilities becomes critical. Hope this benchmark helps push the next wave of practical multimodal generation 🙌
Yifan Yang tweet mediaYifan Yang tweet mediaYifan Yang tweet media
English
0
0
1
22
Microsoft Research
Microsoft Research@MSFTResearch·
Microsoft researchers received the AAAI-26 Outstanding Paper Award for LLM2CLIP, a vision-language framework that uses large language models as “teachers” to help CLIP better understand long, complex captions and achieve state-of-the-art multimodal performance. msft.it/6017QHtL1
Microsoft Research tweet media
English
2
4
28
3.9K
Yifan Yang
Yifan Yang@Yif_Yang·
@MSFTResearch 🏆 Honored to receive the AAAI Outstanding Paper Award! LLM2CLIP achieves new SOTA with CLIP-level fine-tuning cost, boosting long/short-text & cross-lingual image retrieval, and strengthening the visual encoder itself — with strong gains on SigLIP2.
English
0
0
0
46
Yifan Yang
Yifan Yang@Yif_Yang·
🏆 Honored to receive the AAAI Outstanding Paper Award! LLM2CLIP achieves new SOTA with CLIP-level fine-tuning cost, boosting long/short-text & cross-lingual image retrieval, and strengthening the visual encoder itself — with strong gains on SigLIP2.
Microsoft Research@MSFTResearch

Microsoft researchers received the AAAI-26 Outstanding Paper Award for LLM2CLIP, a vision-language framework that uses large language models as “teachers” to help CLIP better understand long, complex captions and achieve state-of-the-art multimodal performance. msft.it/6017QHtL1

English
1
0
1
48
Yifan Yang
Yifan Yang@Yif_Yang·
🧠🎨 ReasonGen-R1 – the first-ever end-to-end framework unlocking Thinking + Generation in autoregressive image generation! Human artists think before they create—shouldn't generative models do the same? 🤔➡️🖼️ How we did it: 1️⃣ Built a rich Instruct → Thinking → Generation dataset & an SFT method enabling models to explore various textual Chain-of-Thought (CoT) paths. 2️⃣ Leveraged Qwen-VL-2.5-7B as our reward model. 3️⃣ Used GRPO Reinforcement Learning to teach the model how to select optimal thoughts for each prompt. Impact: ✨ "Thinking first" significantly improves image fidelity & text alignment: GenEval: 📈 +6% DPG-Bench: 📈 +1.7% T2I-Benchmark: 📈 +13.4% Autoregressive models that plan before creating produce sharper, more aligned visuals—just like real artists! 🚀 🚨 Open-sourced now! Explore our code, dataset, and models👇 🔗 huggingface.co/papers/2505.24… 🔗 aka.ms/reasongen Follow for updates, ablations, and future insights! 🙌 #AI #GenerativeAI #ImageGeneration #DeepLearning #MachineLearning #ChainOfThought #ReinforcementLearning #OpenSource #QwenVL #LLM #VisionAI #SFT #AIArt #TechTwitter #MLCommunity
Yifan Yang tweet mediaYifan Yang tweet media
English
0
0
1
110
Yifan Yang
Yifan Yang@Yif_Yang·
🧠→🖼️ Why shouldn't image generators think before they create? We built an Instruct→CoT→Gen dataset and trained Janus Pro with SFT + GRPO RL (Qwen-VL-2.5-7B rewards): +6 % GenEval, +1.7 % DPG, +13.4 % T2I. Code & data are open! 🔓 aka.ms/reasongen #GenAI #ChainOfThought
English
0
0
0
84
Yifan Yang retweetledi
AK
AK@_akhaliq·
Microsoft just released Phi-4-mini on Hugging Face
AK tweet media
English
13
53
319
35.5K
Yifan Yang
Yifan Yang@Yif_Yang·
@zer0int1 Let me know what we can help with. I think what you are exploring is quite meaningful. Welcome to corporate with us if you like and maybe we can support you in some aspects.
English
1
0
1
24
zer0int (it·its)
zer0int (it·its)@zer0int1·
@Yif_Yang PS: Thank you so much for LLM2CLIP! I am currently experimenting with my own CLIP ViT in LLM2CLIP; re-trained adapter & projection. But it's a real struggle to get all those text embeddings; I just use 1x RTX4090 for all my doings.🙃 Let me know if you have any other questions!
English
1
0
0
47
zer0int (it·its)
zer0int (it·its)@zer0int1·
#HunyuanVideo experiment: Guidance by unaligned #AI. I plugged #LLM2CLIP into this. 🫢 Above: #LLM only guidance. Fascinating how "T-Rex" is still comprehensive (in a very glitchy way). Even more fascinating how adding #CLIP saves the day (at least it's a cinematic video!)
English
4
0
5
286
Yifan Yang retweetledi
Microsoft Research
Microsoft Research@MSFTResearch·
Can a new SOS-RMT protocol enable more efficient CL-MPC?; A fair-by-design, cloud-based algorithmic trading platform; LLM2CLIP unlocks richer visual representation; New technique enhances Low-Rank Adaptation’s expressiveness, generalization capabilities: msft.it/6017oE8px
Microsoft Research tweet media
English
0
4
20
6.4K
Yifan Yang retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
LLM2CLIP makes LLMs teach CLIP how to see the world better Bridging the gap between vision and language using LLMs as teachers 🎯 Original Problem: CLIP, while powerful for multimodal tasks, has limitations in processing long and complex text due to its simple text encoder. Modern LLMs have advanced language capabilities but can't be directly used to improve CLIP due to their poor feature discriminability. ----- 🛠️ Solution in this Paper: → LLM2CLIP transforms LLMs into effective CLIP text encoders through caption contrastive fine-tuning → Uses LoRA to efficiently fine-tune LLM output features for better caption discrimination → Freezes LLM gradients during training to preserve knowledge while adding adapter layers → Pre-extracts text features to maintain computational efficiency similar to regular CLIP ----- 💡 Key Insights: → Native LLM features have poor discriminability (18.4% caption retrieval accuracy) → Caption contrastive fine-tuning boosts LLM discriminability to 73% accuracy → Freezing LLM gradients preserves knowledge while reducing computational costs → Adding adapter layers enables effective vision-language alignment ----- 📊 Results: → Improved previous SOTA EVA02 model by 16.5% on text retrieval tasks → Transformed English-only CLIP into state-of-the-art cross-lingual model → Enhanced performance when integrated with multimodal models like Llava 1.5 → Maintained training costs similar to regular CLIP fine-tuning
Rohan Paul tweet media
English
2
11
34
2.6K