Yunxin Li (@LyxTg) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Yunxin Li@LyxTg·18 Kas

🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…

AK@_akhaliq

Uni-MoE-2.0-Omni Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

English

3

36

237

41.3K

Yunxin Li@LyxTg·5 May

🚀 WindowsWorld | ACL 2026 Thrilled to share WindowsWorld—our new benchmark for autonomous GUI agents in real-world cross-application workflows, accepted to #ACL2026 Findings! 🎉 Most GUI benchmarks focus on single‑app tasks. We target professional multi‑app workflows that mirror real office work. 📌 Key stats: 181 tasks across 17 desktop apps 77.9% multi‑app tasks 4 difficulty levels (L1–L4) ~5 intermediate checkpoints per task for process‑centric evaluation Grounded in 16 professional personas 🔍 Key findings: State‑of‑the‑art agents struggle badly on multi‑app tasks (<21% success) They fail at cross‑app reasoning ≥3 apps Low efficiency even with over‑human step budgets. Full code, benchmark, and VM env are open‑sourced! GitHub: github.com/HITsz-TMG/Wind… Paper: arxiv.org/abs/2604.27776

English

0

2

141

Yunxin Li@LyxTg·29 Nis

@asatoucan Thanks for your attention, you could see the ongoing project (Second-generation AI director system) at github.com/HITsz-TMG/AIGC…

English

0

18

asatoucan@asatoucan·28 Nis

@LyxTg hi, I found the repo when I was looking a way to improve my agent on being a director. I am curious if this is still maintained, or do have another research ongoing? so excited to learn!

English

1

0

21

Yunxin Li@LyxTg·22 Şub

We are in the process of completely updating our Anim-Director based on Vidu 2.0. Here is an excellent animation sample featuring The Little Prince. Click the Link below to watch the video! 👇 Github: github.com/HITsz-TMG/Anim… Youtube: youtube.com/watch?v=txj6Gm…

YouTube

English

1

5

9

460

Yunxin Li@LyxTg·26 Mar

@JustinLin610 similar ideas about our survey of Native Large Multimodal Reasoning Models (N-LMRMs) – scalable, agentic & adaptive reasoning/planning for real-world complex environments! arxiv.org/abs/2505.04921

English

0

2

16

9.3K

Junyang Lin@JustinLin610·26 Mar

x.com/i/article/2037…

ZXX

89

596

3K

862.8K

Yunxin Li@LyxTg·26 Mar

Last year our survey dived into the very same ideas around Native Large Multimodal Reasoning Models (N-LMRMs) – scalable, agentic & adaptive reasoning/planning for real-world complex environments! we also present some technical prospects for Native LMRMs. 🧠 Check out this impactful work: arxiv.org/abs/2505.04921 For my take: Kimi 2.5 is the closest agentic large model to this N-LMRM vision right now.

Junyang Lin@JustinLin610

x.com/i/article/2037…

English

0

1

3

368

Yunxin Li@LyxTg·17 Mar

This amazing paper is from our Lychee Team, hope to see long-horizon memory testing in real applications.

AK@_akhaliq

LMEB Long-horizon Memory Embedding Benchmark paper: huggingface.co/papers/2603.12…

English

0

8

502

Yunxin Li@LyxTg·17 Şub

Just crossed 1,000 citations! 🎓📚 Started from my very first days as a PhD student (2022-2025), now this journey continues post-PhD. Grateful for all the researchers who engaged with my work. Here‘s to many more years of curiosity-driven research! 🥂🔬

English

0

6

481

Yunxin Li@LyxTg·2 Şub

github.com/HITsz-TMG/Uni-…

ZXX

0

1

154

Yunxin Li@LyxTg·31 Oca

Very happy to see our omnimodal large model Uni-MoE-2.0-Omni has been evaluated on a new comprehensive audio-video understanding benchmark, achieving top-3 overall performance and best-in-class tone recognition. @VectorInst

English

1

4

999

Yunxin Li@LyxTg·31 Ara

🚀 Our new work (accepted to IEEE TIP): MKS² — Vision-Enhancing LLMs is here! Current MLLMs use LLMs for multimodal tasks, i.e., "LLMs for Vision". But what if vision could enhance LLMs instead? We propose the "Vision-Enhancing LLM" paradigm—where LLMs learn from visual knowledge to boost their own reasoning and understanding. 🔧 Introducing MKS²: a framework for Multimodal Knowledge Storage & Sharing in LLMs, powered by: 1) Modular Visual Memory (MVM) – stores open-world visual information 2) Mixture of Multimodal Experts (MoME) – enables cross-modal knowledge collaboration The result? Stronger commonsense and physical reasoning by letting LLMs "see to learn." 📄 Paper: arxiv.org/abs/2311.15759

English

0

1

302

Yunxin Li@LyxTg·14 Ara

@FanqingMengAI Original model weight is the shared static expert for multiple LORA finetuning on different tasks, and final RL training may be more efficient after merging different and small transferring matrix weights.

English

1

0

7

2.8K

Fanqing Meng@FanqingMengAI·14 Ara

I feel like I've discovered something strange: My original intention was to train expert models for each sub-task individually, then perform model merging, and continue with multi-task RL to achieve better performance. However, based on my experiments on ReasonGym, I found that LoRA seems to have an advantage over Full in this regard. I don't quite understand why. Has anyone observed something similar? (I mean, it seems like LoRA performs better than full RL in certain scenarios).

Thinking Machines@thinkymachines

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English

30

38

525

97.8K

Yunxin Li retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·4 Ara

Can a single open model truly understand and generate across all modalities? Uni MoE 2.0 from the Lychee family shows it can, with a new dynamic capacity MoE design, progressive multimodal training, and curated data across text, images, speech, and video. Trained on 75B tokens, it outperforms Qwen2.5 Omni on most benchmarks and posts strong gains in video understanding, omnimodal reasoning, audiovisual tasks, speech WER, and controllable image generation. Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Harbin Institute of Technology, Shenzhen Paper: arxiv.org/abs/2511.12609 Project: idealistxy.github.io/Uni-MoE-v2.git… Code: github.com/HITsz-TMG/Uni-… Model: huggingface.co/collections/HI… Our report: mp.weixin.qq.com/s/zV_PRdQ-sWgx… 📬 #PapersAccepted by Jiqizhixin

English

4

11

72

4.2K

Yunxin Li@LyxTg·10 Ara

🎉 New Release! We open-source VideoVista-CoTs, a long CoTs dataset for training ``thinking'' Video-LLMs, also used for training Uni-MoE-2.0-Omni. Perfect for RL cold-start! 🧠 Part of our comprehensive VideoVista series, covering pretrain, finetune, RL, and evaluation. 🔗 Grab the data now and level up your video models: huggingface.co/datasets/HIT-T… #VideoLLM #MultimodalAI #Dataset #OpenSource #MachineLearning

Yunxin Li@LyxTg

Our VideoVista-CulturalLingo has been accepted by #ACL2025 main conference. Welcome to using these 360-Horizons video evaluation benchmarks. Huggingface: huggingface.co/datasets/Uni-M…

English

0

1

4

572

Yunxin Li retweetledi

Luci Pars@parsluci·24 Kas

Uni-MoE-2.0-Omni: Multimodal Dünyayı Birleştiren Model Uni-MoE-2.0-Omni yapay zeka dünyasında multimodal modelleri yani metin, görüntü, video ve ses gibi farklı türleri bir arada işleyen sistemleri sadece anlamaktan öteye üretmeye de taşıyan yeni bir adım. Bu model eski yoğun yapay zeka modellerini verimli bir hale getirmek için MoE denen bir yaklaşım kullanıyor. Basitçe görevlere göre kaynakları akıllıca dağıtan bir sistem gibi düşün karmaşık işlerde daha fazla güç harcıyor, basitlerde tasarruf ediyor. Mimarisinde 3D RoPE gibi bir yenilikle ses, metin, görüntü ve videoyu zaman ve mekan boyutlarında birleştiriyor böylece her türlü girdiyi daha iyi yakalıyor. Üstelik derinlemesine kaynaşmış bir kod çözücü tasarımı var bu sayede giriş ve çıkışları istediğin gibi karıştırabiliyorsun. Mesela bir videodan ses üretmek veya bir görüntüye metin eklemek gibi. Toplamda sadece 75 milyar token verisiyle ki bu rakiplerin kullandığı 1.2 trilyonun çok altında 76 görevden 50'sinden fazlasında Qwen2.5-Omni'yi geçmiş. Video anlama konusunda yüzde 5, genel multimodal anlayışta yüzde 7, sesli soru-cevapta yüzde 4.3 ve görüntü işleme alanında yüzde 7 önde. Eğitimde dil odaklı bir karışım kullanmışlar yani her şeyi metin üretimi çerçevesinde birleştirmişler, böylece türler arası bariyerleri kaldırmış. Bu modelle ses üretip sohbet edebiliyor, görüntü oluşturup düzenleyebiliyor, görüntü ve video analiz edebiliyor, ses-görüntü mantık yürütme yapabiliyor ve 10'dan fazla multimodal görev hallediyor. En güzeli hepsi açık kaynak. Not : reklam değildir güzel gelişme olduğu için paylaştım.

Yunxin Li@LyxTg

🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…

Türkçe

0

1

10

1.5K

Yunxin Li@LyxTg·23 Kas

Thank you so much for covering our first version of Uni-MoE! We recently open-sourced Uni-MoE-2.0-Omni, which supports full-modal understanding and generation—including speech, image, and text. We’d love to hear your thoughts and engage in discussion! HF Link: huggingface.co/collections/HI… Codes: github.com/HITsz-TMG/Uni-…

English

0

1

102

Philipp Schmid@_philschmid·24 May

Is this the architecture behind @OpenAI GPT-4o? Uni-MoE proposes an MoE-based unified Multimodal Large Language Model (MLLM) that can handle audio, speech, image, text, and video. 👂👄👀💬🎥 Uni-MoE is a native multimodal Mixture of Experts (MoE) architecture with a three-phase training strategy that includes cross-modality alignment, expert activation, and fine-tuning with Low-Rank Adaptation (LoRA). 🤔 TL;DR: 🚀 Uni-MoE uses modality-specific encoders with connectors for a unified multimodal representation. 💡 Utilizes sparse MoE architecture for efficient training and inference 🧑‍🏫 Three-phase training: 1) Train connectors for different modalities 2) Modality-specific expert training with cross-modality instruction data. 3) Fine-tuning with LoRA on mixed multimodal data. 📊 Uni-MoE matches or outperforms other MLLMs on 10 tested vision and audio tasks 🏆 Outperforms existing unified multimodal models on comprehensive benchmarks Paper: huggingface.co/papers/2405.11… Github: github.com/HITsz-TMG/UMOE…

English

17

114

530

55K

Yunxin Li@LyxTg·23 Kas

@casper_hansen_ Routing is the bottleneck in large-scale distributed training of MoE model, no doubt. But the path to massive scale isn't through dense models. The efficient use of parameters is key, and that's why MoE is the way.

English

0

1

358

Casper Hansen@casper_hansen_·22 Kas

Don’t get me wrong, I love everyone contributing to open-source but why would you train a larger dense model instead of MoE in 2025?

English

21

9

173

30.2K

Yunxin Li@LyxTg·21 Kas

@FanqingMengAI Sora may adopt similar approach.

English

1

0

69

Fanqing Meng@FanqingMengAI·21 Kas

See: arxiv.org/abs/2410.05363 I think there exists a LLM aimed at augment prompt, and then its prompt can be used in banana. So it is obvious that we can see this T2I model seems to have a lot of world knowledge

Jeff Dean@JeffDean

Create an illustrated explainer, detailing the physics of the fluid dynamics that are caught in this image and what happens next.

English

1

6

1.5K

Yunxin Li@LyxTg·19 Kas

Achieving such high scores on both GPQA Diamond and Video-MMMU demonstrates that Gemini's knowledge base and visually-driven multimodal Q&A have reached a true state-of-the-art level.

English

0

2

455

Yunxin Li@LyxTg·18 Kas

Code Repository: github.com/HITsz-TMG/Uni-…

Română

0

5

331

Yunxin Li@LyxTg·18 Kas

🚀 We're thrilled to announce Uni-MoE-2.0-Omni - a groundbreaking omnimodal large model that evolves from multimodal understanding to seamless understanding AND generation! ✨ What's New: We explored how to transform dense LLMs into efficient MoE-driven omnimodal large models through progressive architecture evolution and training strategies. 🧠 Architecture Innovations: 1️⃣ Novel Omnimodality 3D RoPE + Dynamic Capacity MoE Unifies aliand gnment across speech, text, images, video in spatiotemporal dimensions, better for omnimodal inputs Adaptive computation allocation based on task complexity 2️⃣ Deeply fused multimodal encoder-decoder design Supports any combination of input/output modalities Enables true omnimodal interaction & generation 🛠️ Training Breakthroughs: 1️⃣ Progressive training strategy: Cross-modal alignment → Expert warm-up → MoE fine-tuning & RL → Generative training Efficiently scales dense LLMs to omnimodal MoE-based large models with a total of 75B tokens Ensures stable convergence with less data, especially for RL 2️⃣ Language-anchored mixed understanding and generating training Unifies understanding & generation tasks under a language generation framework Breaks down barriers between modalities 🎨 Capabilities: ✅ Speech generation & interaction ✅ Image generation & editing ✅ Image/Video understanding ✅ Audio-visual reasoning ✅ And 10+ multimodal tasks! 🔥 Key Results: Outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks with only 75B tokens! 📈 Video understanding (8): +5% 📈 Omnimodal understanding (4): +7% 📈 Speech QA: +4.3% lead 📈 Image processing: +7% lead 🌍 Now Open Source! Model: huggingface.co/collections/HI… Code: github.com/HITsz-TMG/Uni-… Homepage: idealistxy.github.io/Uni-MoE-v2.git…

AK@_akhaliq

Uni-MoE-2.0-Omni Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

English

3

36

237

41.3K

Yunxin Li@LyxTg·18 Kas

@codewithimanshu You're spot on! Figuring out how to efficiently scale existing LLMs and VLMs into powerful omnimodal models is really about unlocking new skills like perception and reasoning. This field has so much potential - can't wait to see what comes next!

English

0

205

Himanshu Kumar@codewithimanshu·18 Kas

@LyxTg Congratulations on the launch, Yunxin! Transforming dense LLMs is indeed a game changer, right?

English

1

0

2

247

Yunxin Li@LyxTg·18 Kas

@skalskip92 Qwen3-VL-32B and 8B have the same-scale training corpus?

English

1

0

1

226

SkalskiP@skalskip92·17 Kas

Qwen3-VL 32B it missed only 2 taxis (8B missed 4) here’s the notebook if you want to test it on different open vocabulary prompts colab.research.google.com/github/roboflo…

SkalskiP@skalskip92

how many taxis do you see in this image? Qwen3-VL is so good at recognition, multi-target grounding, and understands spatial relations. it blows my mind.

English

9

25

330

28.6K

Yunxin Li

Keşfet