Yifan Yang

48 posts

Yifan Yang

@Yif_Yang

Multimodality researcher from Microsoft Research Asia.

Shanghai Katılım Temmuz 2017

88 Takip Edilen30 Takipçiler

Sabitlenmiş Tweet

Yifan Yang@Yif_Yang·11 Kas

🌟 Tired of CLIP's limitations and short input windows? ✨ Meet LLM2CLIP—our secret to making the SOTA CLIP model even more SOTA! By enabling LLMs to act as CLIP's "teacher," we achieve significant performance gains with minimal data and training. We found that LLMs struggle to distinguish image captions, making CLIP training confusing. But with our caption-to-caption contrastive finetuning, LLMs reveal their text comprehension in output features, becoming the ideal mentor for CLIP. 🧑‍🏫 LLM2CLIP overcomes CLIP’s weaknesses: limited text understanding, bag-of-words-like behavior, short input windows, and structural challenges with dense, complex captions. With LLM's open-world knowledge, we maximize CLIP’s capacity on dense captions, achieving efficient and robust training across various text retrievals and LLava benchmarks. 🚀 📄 Paper: huggingface.co/papers/2411.04… review and already accept at NeurIPS 2024 SSL Workshop. 🔗 Models & Code: aka.ms/llm2clipReady to give your CLIP a "super private tutor" or use our top-performing CLIP model! #LLM2CLIP #CLIP #MachineLearning #AI #NeurIPS2024 #ContrastiveLearning #DeepLearning #AIResearch

English

8.2K

Yifan Yang@Yif_Yang·1d

Thanks AK for sharing our work! 🚀 Excited to share our BizGenEval — the first systematic benchmark for commercial visual content generation 📄 Paper: huggingface.co/papers/2603.25… 🌐 Project: aka.ms/bizGenEval 💻 Code: github.com/microsoft/BizG… Unlike prior benchmarks focusing on natural images, we evaluate real-world design tasks: 🧩 5 domains: Slides, Webpages, Charts, Posters, Scientific Figures 🧠 4 capabilities: Text Rendering, Layout Control, Attribute Binding, Knowledge Reasoning ➡️ 20 tasks in total We curated: 400 prompts (300 real-world + 100 knowledge-intensive) 8,000 fine-grained checklist questions Evaluated 26 SOTA models (Nano-Banana, GPT-Image, Seedream, QwenImage, etc.) 📊 Key finding: Strong performance on natural-image benchmarks ≠ real commercial design ability Even top models still struggle with: precise layout control attribute binding multi-constraint reasoning 💡 As generative models become design tools, measuring these capabilities becomes critical. Hope this benchmark helps push the next wave of practical multimodal generation 🙌

English

Yifan Yang retweetledi

AK@_akhaliq·1d

BizGenEval A Systematic Benchmark for Commercial Visual Content Generation paper: huggingface.co/papers/2603.25…

English

4.6K

Yifan Yang@Yif_Yang·3d

🚀 Excited to share our latest work: BizGenEval — the first systematic benchmark for commercial visual content generation 📄 Paper: huggingface.co/papers/2603.25… 🌐 Project: aka.ms/bizGenEval 💻 Code: github.com/microsoft/BizG… Unlike prior benchmarks focusing on natural images, we evaluate real-world design tasks: 🧩 5 domains: Slides, Webpages, Charts, Posters, Scientific Figures 🧠 4 capabilities: Text Rendering, Layout Control, Attribute Binding, Knowledge Reasoning ➡️ 20 tasks in total We curated: 400 prompts (300 real-world + 100 knowledge-intensive) 8,000 fine-grained checklist questions Evaluated 26 SOTA models (Nano-Banana, GPT-Image, Seedream, QwenImage, etc.) 📊 Key finding: Strong performance on natural-image benchmarks ≠ real commercial design ability Even top models still struggle with: precise layout control attribute binding multi-constraint reasoning 💡 As generative models become design tools, measuring these capabilities becomes critical. Hope this benchmark helps push the next wave of practical multimodal generation 🙌

English

Yifan Yang@Yif_Yang·3d

x.com/i/article/2039…

ZXX

Yifan Yang@Yif_Yang·30 Oca

@MSFTResearch 👉 Try it: aka.ms/llm2clip 🚀

English

Microsoft Research@MSFTResearch·30 Oca

Microsoft researchers received the AAAI-26 Outstanding Paper Award for LLM2CLIP, a vision-language framework that uses large language models as “teachers” to help CLIP better understand long, complex captions and achieve state-of-the-art multimodal performance. msft.it/6017QHtL1

English

3.9K

Yifan Yang@Yif_Yang·30 Oca

@MSFTResearch 🏆 Honored to receive the AAAI Outstanding Paper Award! LLM2CLIP achieves new SOTA with CLIP-level fine-tuning cost, boosting long/short-text & cross-lingual image retrieval, and strengthening the visual encoder itself — with strong gains on SigLIP2.

English

Yifan Yang@Yif_Yang·30 Oca

👉 Try it: aka.ms/llm2clip 🚀

English

Yifan Yang@Yif_Yang·30 Oca

🏆 Honored to receive the AAAI Outstanding Paper Award! LLM2CLIP achieves new SOTA with CLIP-level fine-tuning cost, boosting long/short-text & cross-lingual image retrieval, and strengthening the visual encoder itself — with strong gains on SigLIP2.

Microsoft Research@MSFTResearch

English

Yifan Yang@Yif_Yang·2 Haz

🧠🎨 ReasonGen-R1 – the first-ever end-to-end framework unlocking Thinking + Generation in autoregressive image generation! Human artists think before they create—shouldn't generative models do the same? 🤔➡️🖼️ How we did it: 1️⃣ Built a rich Instruct → Thinking → Generation dataset & an SFT method enabling models to explore various textual Chain-of-Thought (CoT) paths. 2️⃣ Leveraged Qwen-VL-2.5-7B as our reward model. 3️⃣ Used GRPO Reinforcement Learning to teach the model how to select optimal thoughts for each prompt. Impact: ✨ "Thinking first" significantly improves image fidelity & text alignment: GenEval: 📈 +6% DPG-Bench: 📈 +1.7% T2I-Benchmark: 📈 +13.4% Autoregressive models that plan before creating produce sharper, more aligned visuals—just like real artists! 🚀 🚨 Open-sourced now! Explore our code, dataset, and models👇 🔗 huggingface.co/papers/2505.24… 🔗 aka.ms/reasongen Follow for updates, ablations, and future insights! 🙌 #AI #GenerativeAI #ImageGeneration #DeepLearning #MachineLearning #ChainOfThought #ReinforcementLearning #OpenSource #QwenVL #LLM #VisionAI #SFT #AIArt #TechTwitter #MLCommunity

English

110

Yifan Yang@Yif_Yang·2 Haz

🧠→🖼️ Why shouldn't image generators think before they create? We built an Instruct→CoT→Gen dataset and trained Janus Pro with SFT + GRPO RL (Qwen-VL-2.5-7B rewards): +6 % GenEval, +1.7 % DPG, +13.4 % T2I. Code & data are open! 🔓 aka.ms/reasongen #GenAI #ChainOfThought

English

Yifan Yang@Yif_Yang·3 Mar

LLM2CLIP is been used in Phi-4-mini as Siglip model finetuning.

AK@_akhaliq

Microsoft just released Phi-4-mini on Hugging Face

English

Yifan Yang retweetledi

AK@_akhaliq·27 Şub

Microsoft just released Phi-4-mini on Hugging Face

English

319

35.5K

Yifan Yang retweetledi

Ziming Liu@lzm_mlsys·17 Şub

🚀Towards efficient Diffusion Transformers! 😆We are happy to introduce RAS, the first diffusion sampling strategy that allows for regional variability in sampling ratios, achieving up to 2x+ speedup! 🔌Training-free, plug and play! 💪Nice work with @MSFTResearch @YangYou1991 @Yif_Yang et al. 📜Paper: huggingface.co/papers/2502.10… 📖Blog: aka.ms/ras-dit ⌨️Code: github.com/microsoft/RAS (1/5)

English

188

18K

Yifan Yang@Yif_Yang·22 Ara

@zer0int1 Let me know what we can help with. I think what you are exploring is quite meaningful. Welcome to corporate with us if you like and maybe we can support you in some aspects.

English

zer0int (it·its)@zer0int1·21 Ara

@Yif_Yang PS: Thank you so much for LLM2CLIP! I am currently experimenting with my own CLIP ViT in LLM2CLIP; re-trained adapter & projection. But it's a real struggle to get all those text embeddings; I just use 1x RTX4090 for all my doings.🙃 Let me know if you have any other questions!

English

zer0int (it·its)@zer0int1·21 Ara

#HunyuanVideo experiment: Guidance by unaligned #AI. I plugged #LLM2CLIP into this. 🫢 Above: #LLM only guidance. Fascinating how "T-Rex" is still comprehensive (in a very glitchy way). Even more fascinating how adding #CLIP saves the day (at least it's a cinematic video!)

English

286

Yifan Yang@Yif_Yang·11 Ara

Welcome to try our LLM2CLIP: aka.ms/llm2clip

Microsoft Research@MSFTResearch

Can a new SOS-RMT protocol enable more efficient CL-MPC?; A fair-by-design, cloud-based algorithmic trading platform; LLM2CLIP unlocks richer visual representation; New technique enhances Low-Rank Adaptation’s expressiveness, generalization capabilities: msft.it/6017oE8px

English

Yifan Yang retweetledi

Microsoft Research@MSFTResearch·4 Ara

English

6.4K

Yifan Yang retweetledi

Rohan Paul@rohanpaul_ai·29 Kas

LLM2CLIP makes LLMs teach CLIP how to see the world better Bridging the gap between vision and language using LLMs as teachers 🎯 Original Problem: CLIP, while powerful for multimodal tasks, has limitations in processing long and complex text due to its simple text encoder. Modern LLMs have advanced language capabilities but can't be directly used to improve CLIP due to their poor feature discriminability. ----- 🛠️ Solution in this Paper: → LLM2CLIP transforms LLMs into effective CLIP text encoders through caption contrastive fine-tuning → Uses LoRA to efficiently fine-tune LLM output features for better caption discrimination → Freezes LLM gradients during training to preserve knowledge while adding adapter layers → Pre-extracts text features to maintain computational efficiency similar to regular CLIP ----- 💡 Key Insights: → Native LLM features have poor discriminability (18.4% caption retrieval accuracy) → Caption contrastive fine-tuning boosts LLM discriminability to 73% accuracy → Freezing LLM gradients preserves knowledge while reducing computational costs → Adding adapter layers enables effective vision-language alignment ----- 📊 Results: → Improved previous SOTA EVA02 model by 16.5% on text retrieval tasks → Transformed English-only CLIP into state-of-the-art cross-lingual model → Enhanced performance when integrated with multimodal models like Llava 1.5 → Maintained training costs similar to regular CLIP fine-tuning

English

2.6K

Keşfet

@MSFTResearch @YangYou1991 @zer0int1 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates