Yunxin Li

294 posts

Yunxin Li banner
Yunxin Li

Yunxin Li

@LyxTg

PhD from HITsz. Currently focusing on multimodal reasoning and planning with large models. Past research interns: HKUST, ByteDance Seed, Tencent PCG/AILab.

Katılım Nisan 2021
538 Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Yunxin Li
Yunxin Li@LyxTg·
🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…
Yunxin Li tweet mediaYunxin Li tweet mediaYunxin Li tweet mediaYunxin Li tweet media
AK@_akhaliq

Uni-MoE-2.0-Omni Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

English
3
36
237
41.3K
Yunxin Li
Yunxin Li@LyxTg·
🚀 WindowsWorld | ACL 2026 Thrilled to share WindowsWorld—our new benchmark for autonomous GUI agents in real-world cross-application workflows, accepted to #ACL2026 Findings! 🎉 Most GUI benchmarks focus on single‑app tasks. We target professional multi‑app workflows that mirror real office work. 📌 Key stats: 181 tasks across 17 desktop apps 77.9% multi‑app tasks 4 difficulty levels (L1–L4) ~5 intermediate checkpoints per task for process‑centric evaluation Grounded in 16 professional personas 🔍 Key findings: State‑of‑the‑art agents struggle badly on multi‑app tasks (<21% success) They fail at cross‑app reasoning ≥3 apps Low efficiency even with over‑human step budgets. Full code, benchmark, and VM env are open‑sourced! GitHub: github.com/HITsz-TMG/Wind… Paper: arxiv.org/abs/2604.27776
English
0
0
2
141
asatoucan
asatoucan@asatoucan·
@LyxTg hi, I found the repo when I was looking a way to improve my agent on being a director. I am curious if this is still maintained, or do have another research ongoing? so excited to learn!
English
1
0
0
21
Yunxin Li
Yunxin Li@LyxTg·
@JustinLin610 similar ideas about our survey of Native Large Multimodal Reasoning Models (N-LMRMs) – scalable, agentic & adaptive reasoning/planning for real-world complex environments! arxiv.org/abs/2505.04921
English
0
2
16
9.3K
Yunxin Li
Yunxin Li@LyxTg·
Last year our survey dived into the very same ideas around Native Large Multimodal Reasoning Models (N-LMRMs) – scalable, agentic & adaptive reasoning/planning for real-world complex environments! we also present some technical prospects for Native LMRMs. 🧠 Check out this impactful work: arxiv.org/abs/2505.04921 For my take: Kimi 2.5 is the closest agentic large model to this N-LMRM vision right now.
Junyang Lin@JustinLin610

x.com/i/article/2037…

English
0
1
3
368
Yunxin Li
Yunxin Li@LyxTg·
Just crossed 1,000 citations! 🎓📚 Started from my very first days as a PhD student (2022-2025), now this journey continues post-PhD. Grateful for all the researchers who engaged with my work. Here‘s to many more years of curiosity-driven research! 🥂🔬
Yunxin Li tweet media
English
0
0
6
481
Yunxin Li
Yunxin Li@LyxTg·
Very happy to see our omnimodal large model Uni-MoE-2.0-Omni has been evaluated on a new comprehensive audio-video understanding benchmark, achieving top-3 overall performance and best-in-class tone recognition. @VectorInst
Yunxin Li tweet media
English
1
1
4
999
Yunxin Li
Yunxin Li@LyxTg·
🚀 Our new work (accepted to IEEE TIP): MKS² — Vision-Enhancing LLMs is here! Current MLLMs use LLMs for multimodal tasks, i.e., "LLMs for Vision". But what if vision could enhance LLMs instead? We propose the "Vision-Enhancing LLM" paradigm—where LLMs learn from visual knowledge to boost their own reasoning and understanding. 🔧 Introducing MKS²: a framework for Multimodal Knowledge Storage & Sharing in LLMs, powered by: 1) Modular Visual Memory (MVM) – stores open-world visual information 2) Mixture of Multimodal Experts (MoME) – enables cross-modal knowledge collaboration The result? Stronger commonsense and physical reasoning by letting LLMs "see to learn." 📄 Paper: arxiv.org/abs/2311.15759
Yunxin Li tweet media
English
0
0
1
302
Yunxin Li
Yunxin Li@LyxTg·
@FanqingMengAI Original model weight is the shared static expert for multiple LORA finetuning on different tasks, and final RL training may be more efficient after merging different and small transferring matrix weights.
English
1
0
7
2.8K
Fanqing Meng
Fanqing Meng@FanqingMengAI·
I feel like I've discovered something strange: My original intention was to train expert models for each sub-task individually, then perform model merging, and continue with multi-task RL to achieve better performance. However, based on my experiments on ReasonGym, I found that LoRA seems to have an advantage over Full in this regard. I don't quite understand why. Has anyone observed something similar? (I mean, it seems like LoRA performs better than full RL in certain scenarios).
Fanqing Meng tweet media
Thinking Machines@thinkymachines

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English
30
38
525
97.8K
Yunxin Li retweetledi
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
Can a single open model truly understand and generate across all modalities? Uni MoE 2.0 from the Lychee family shows it can, with a new dynamic capacity MoE design, progressive multimodal training, and curated data across text, images, speech, and video. Trained on 75B tokens, it outperforms Qwen2.5 Omni on most benchmarks and posts strong gains in video understanding, omnimodal reasoning, audiovisual tasks, speech WER, and controllable image generation. Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Harbin Institute of Technology, Shenzhen Paper: arxiv.org/abs/2511.12609 Project: idealistxy.github.io/Uni-MoE-v2.git… Code: github.com/HITsz-TMG/Uni-… Model: huggingface.co/collections/HI… Our report: mp.weixin.qq.com/s/zV_PRdQ-sWgx… 📬 #PapersAccepted by Jiqizhixin
机器之心 JIQIZHIXIN tweet media
English
4
11
72
4.2K
Yunxin Li
Yunxin Li@LyxTg·
🎉 New Release! We open-source VideoVista-CoTs, a long CoTs dataset for training ``thinking'' Video-LLMs, also used for training Uni-MoE-2.0-Omni. Perfect for RL cold-start! 🧠 Part of our comprehensive VideoVista series, covering pretrain, finetune, RL, and evaluation. 🔗 Grab the data now and level up your video models: huggingface.co/datasets/HIT-T… #VideoLLM #MultimodalAI #Dataset #OpenSource #MachineLearning
Yunxin Li@LyxTg

Our VideoVista-CulturalLingo has been accepted by #ACL2025 main conference. Welcome to using these 360-Horizons video evaluation benchmarks. Huggingface: huggingface.co/datasets/Uni-M…

English
0
1
4
572
Yunxin Li retweetledi
Luci Pars
Luci Pars@parsluci·
Uni-MoE-2.0-Omni: Multimodal Dünyayı Birleştiren Model Uni-MoE-2.0-Omni yapay zeka dünyasında multimodal modelleri yani metin, görüntü, video ve ses gibi farklı türleri bir arada işleyen sistemleri sadece anlamaktan öteye üretmeye de taşıyan yeni bir adım. Bu model eski yoğun yapay zeka modellerini verimli bir hale getirmek için MoE denen bir yaklaşım kullanıyor. Basitçe görevlere göre kaynakları akıllıca dağıtan bir sistem gibi düşün karmaşık işlerde daha fazla güç harcıyor, basitlerde tasarruf ediyor. Mimarisinde 3D RoPE gibi bir yenilikle ses, metin, görüntü ve videoyu zaman ve mekan boyutlarında birleştiriyor böylece her türlü girdiyi daha iyi yakalıyor. Üstelik derinlemesine kaynaşmış bir kod çözücü tasarımı var bu sayede giriş ve çıkışları istediğin gibi karıştırabiliyorsun. Mesela bir videodan ses üretmek veya bir görüntüye metin eklemek gibi. Toplamda sadece 75 milyar token verisiyle ki bu rakiplerin kullandığı 1.2 trilyonun çok altında 76 görevden 50'sinden fazlasında Qwen2.5-Omni'yi geçmiş. Video anlama konusunda yüzde 5, genel multimodal anlayışta yüzde 7, sesli soru-cevapta yüzde 4.3 ve görüntü işleme alanında yüzde 7 önde. Eğitimde dil odaklı bir karışım kullanmışlar yani her şeyi metin üretimi çerçevesinde birleştirmişler, böylece türler arası bariyerleri kaldırmış. Bu modelle ses üretip sohbet edebiliyor, görüntü oluşturup düzenleyebiliyor, görüntü ve video analiz edebiliyor, ses-görüntü mantık yürütme yapabiliyor ve 10'dan fazla multimodal görev hallediyor. En güzeli hepsi açık kaynak. Not : reklam değildir güzel gelişme olduğu için paylaştım.
Yunxin Li@LyxTg

🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…

Türkçe
0
1
10
1.5K
Yunxin Li
Yunxin Li@LyxTg·
Thank you so much for covering our first version of Uni-MoE! We recently open-sourced Uni-MoE-2.0-Omni, which supports full-modal understanding and generation—including speech, image, and text. We’d love to hear your thoughts and engage in discussion! HF Link: huggingface.co/collections/HI… Codes: github.com/HITsz-TMG/Uni-…
English
0
0
1
102
Philipp Schmid
Philipp Schmid@_philschmid·
Is this the architecture behind @OpenAI GPT-4o? Uni-MoE proposes an MoE-based unified Multimodal Large Language Model (MLLM) that can handle audio, speech, image, text, and video. 👂👄👀💬🎥 Uni-MoE is a native multimodal Mixture of Experts (MoE) architecture with a three-phase training strategy that includes cross-modality alignment, expert activation, and fine-tuning with Low-Rank Adaptation (LoRA). 🤔 TL;DR: 🚀 Uni-MoE uses modality-specific encoders with connectors for a unified multimodal representation. 💡 Utilizes sparse MoE architecture for efficient training and inference 🧑‍🏫 Three-phase training: 1) Train connectors for different modalities 2) Modality-specific expert training with cross-modality instruction data. 3) Fine-tuning with LoRA on mixed multimodal data. 📊 Uni-MoE matches or outperforms other MLLMs on 10 tested vision and audio tasks 🏆 Outperforms existing unified multimodal models on comprehensive benchmarks Paper: huggingface.co/papers/2405.11… Github: github.com/HITsz-TMG/UMOE…
Philipp Schmid tweet media
English
17
114
530
55K
Yunxin Li
Yunxin Li@LyxTg·
@casper_hansen_ Routing is the bottleneck in large-scale distributed training of MoE model, no doubt. But the path to massive scale isn't through dense models. The efficient use of parameters is key, and that's why MoE is the way.
English
0
0
1
358
Casper Hansen
Casper Hansen@casper_hansen_·
Don’t get me wrong, I love everyone contributing to open-source but why would you train a larger dense model instead of MoE in 2025?
English
21
9
173
30.2K
Yunxin Li
Yunxin Li@LyxTg·
Achieving such high scores on both GPQA Diamond and Video-MMMU demonstrates that Gemini's knowledge base and visually-driven multimodal Q&A have reached a true state-of-the-art level.
Yunxin Li tweet media
English
0
0
2
455
Yunxin Li
Yunxin Li@LyxTg·
🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…
Yunxin Li tweet mediaYunxin Li tweet mediaYunxin Li tweet mediaYunxin Li tweet media
AK@_akhaliq

Uni-MoE-2.0-Omni Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

English
3
36
237
41.3K
Yunxin Li
Yunxin Li@LyxTg·
@codewithimanshu You're spot on! Figuring out how to efficiently scale existing LLMs and VLMs into powerful omnimodal models is really about unlocking new skills like perception and reasoning. This field has so much potential - can't wait to see what comes next!
English
0
0
0
205
Himanshu Kumar
Himanshu Kumar@codewithimanshu·
@LyxTg Congratulations on the launch, Yunxin! Transforming dense LLMs is indeed a game changer, right?
English
1
0
2
247
Yunxin Li
Yunxin Li@LyxTg·
@skalskip92 Qwen3-VL-32B and 8B have the same-scale training corpus?
English
1
0
1
226