Mingfei Chen

32 posts

Mingfei Chen

@lasiafly

PhD Student @UW | PhD Fellow @Google | Multi-Modal LLMs, Spatial AI, XR, Robotics | Previously @Meta & @NUSingapore

Seattle, WA Katılım Aralık 2021

130 Takip Edilen61 Takipçiler

Mingfei Chen@lasiafly·2 Şub

AI Native might be the next future of open source repo style. Traditional frameworks handle complexity with massive configuration files. I found an interesting repo that were released just now - NanoClaw suggests a different paradigm: Software that evolves by rewriting itself, rather than just changing settings. Key Features from the repo: • Zero Config: The code is the state. The AI refactors the source to add features instead of toggling flags. • Skills > Features: You teach the system how to build a feature (via Skill MDs) rather than merging bloated PRs for every use case. • True Isolation: Runtimes as sandboxed containers, not just permission checks. We are moving from building static tools to designing self-evolving seeds. Fascinating experiment. #AIAgents #AgenticAI #OpenSource #OpenClaw github.com/gavrielc/nanoc…

English

310

Mingfei Chen@lasiafly·15 Oca

Nice work, great spatial consistency! I wonder how easily the explicit 3D memory can be adapted to dynamic scenes video gen. Also in terms of 3d mem, we may be able to simplify the full detailed representation like point cloud to some sparse cues? Since we have video generation head to infill the details.

English

148

Kwang Moo Yi@kwangmoo_yi·14 Oca

Zhao and Wei et al., "Spatia: Video Generation with Updatable Spatial Memory" "Memory" for video models via point cloud-conditioned video generation. I am obviously still biased towards having these "explicit" 3D stuff.

English

239

14.3K

Mingfei Chen retweetledi

Xiao Fu@lemonaddie0909·12 Oca

Video generation, but 4D, dynamic, scene-consistent, and very long at the same time?! Introducing 𝐏𝐥𝐞𝐧𝐨𝐩𝐭𝐢𝐜𝐃𝐫𝐞𝐚𝐦𝐞𝐫, 𝐦𝐮𝐥𝐭𝐢-𝐯𝐢𝐞𝐰 𝐯𝐢𝐝𝐞𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐰𝐢𝐭𝐡 𝐥𝐨𝐧𝐠-𝐭𝐞𝐫𝐦 𝐬𝐩𝐚𝐭𝐢𝐨-𝐭𝐞𝐦𝐩𝐨𝐫𝐚𝐥 𝐦𝐞𝐦𝐨𝐫𝐲! The scaling secret is very simple: an autoregressive paradigm with minimal 3D inductive bias, aided with a spatially grounded memory retrieval mechanism. 🌐 Project page: research.nvidia.com/labs/dir/pleno… 🌐 Paper: arxiv.org/pdf/2601.05239

English

393

33.3K

Mingfei Chen retweetledi

AK@_akhaliq·8 Oca

Klear Unified Multi-Task Audio-Video Joint Generation huggingface.co/papers/2601.04…

English

6.7K

Mingfei Chen retweetledi

Wenlong Huang@wenlong_huang·8 Oca

What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 point-world.github.io from @Stanford @nvidia

English

225

1.3K

234.9K

Mingfei Chen retweetledi

Qwen@Alibaba_Qwen·8 Oca

🚀 Introducing Qwen3-VL-Embedding and Qwen3-VL-Reranker – advancing the state of the art in multimodal retrieval and cross-modal understanding! ✨ Highlights: ✅ Built upon the robust Qwen3-VL foundation model ✅ Processes text, images, screenshots, videos, and mixed modality inputs ✅ Supports 30+ languages ✅ Achieves state-of-the-art performance on multimodal retrieval benchmarks ✅ Open source and available on Hugging Face, GitHub, and ModelScope ✅ API deployment on Alibaba Cloud coming soon! 🎯 Two-stage retrieval architecture: 📊 Embedding Model – generates semantically rich vector representations in a unified embedding space 🎯 Reranker Model – computes fine-grained relevance scores for enhanced retrieval accuracy 🔍 Key application scenarios: Image-text retrieval, video search, multimodal RAG, visual question answering, multimodal content clustering, multilingual visual search, and more! 🌟 Developer-friendly capabilities: • Configurable embedding dimensions • Task-specific instruction customization • Embedding quantization support for efficient and cost-effective downstream deployment Hugging Face： huggingface.co/collections/Qw… huggingface.co/collections/Qw… ModelScope： modelscope.cn/collections/Qw… modelscope.cn/collections/Qw… Github: github.com/QwenLM/Qwen3-V… Blog: qwen.ai/blog?id=qwen3-… Tech Report:github.com/QwenLM/Qwen3-V…

English

291

1.9K

210.1K

Mingfei Chen@lasiafly·5 Oca

Interesting work! The future of large foundation models may well shift from dense token processing toward high level semantic key state tokens, which could be especially powerful for long context reasoning.

Ge Zhang@GeZhang86038849

0/9 Glad to introduce Dynamic Large Concept Models (DLCM), a hierarchical architecture that moves LLMs beyond inefficient, uniform token-level processing. Rather than doing prediction of the next subword token or fixed number of tokens, DLCM dynamically generates the next concepts with adaptive boundary with an end2end training paradigm. With more rational dynamic compute allocation, DLCM can reduce inference flops by 34% compared to standard dense transformer architecture. And the efficiency benefit is growing when the model size and context length is growing.

English

154

Mingfei Chen@lasiafly·3 Oca

EgoMAN Connects vision–language reasoning with physically grounded 6-DoF hand motion for stage-aware trajectory prediction from egocentric videos—enabling proactive AI interaction, human motion synthesis, and robotic learning from human data. Paper: arxiv.org/abs/2512.16907 Website: egoman-project.github.io Video: egoman-project.github.io/static/videos/… Code: github.com/facebookresear…

English

Mingfei Chen@lasiafly·3 Oca

Happy New Year! 🎉 Over the past two years, I spent nearly half of each year as a research intern at @Meta, working with Reality Labs Research in Pittsburgh and Redmond. As #2026year begins, I’ve now wrapped up this chapter. These experiences deeply shaped how I think about research and system building, from problem formulation to multimodal modeling and long-term vision. Grateful for the mentorship, collaboration, and research culture. I’m glad to share that two major projects from this time are now fully wrapped up and released, both centered on multimodal 3D learning. 👇 #MultimodalAI #3DReasoning #SpatialAI #AudioVisual #EgocentricVision #RealityLabs #Research

English

144

Mingfei Chen@lasiafly·3 Oca

SoundVista (CVPR 2025 Highlight) Multimodal 3D scene rendering that binds visual geometry with acoustics to synthesize spatially consistent ambient sound at unseen viewpoints from sparse recordings. Paper: openaccess.thecvf.com/content/CVPR20… Website: yoyomimi.github.io/SoundVista.git… Video: m.youtube.com/watch?v=3iorgh… Code: github.com/facebookresear…

English

Mingfei Chen@lasiafly·31 Ara

@mangahomanga Code released here: github.com/facebookresear…

English

Mingfei Chen retweetledi

Homanga Bharadhwaj@mangahomanga·19 Ara

Motion cues from human videos + Reasoning from VLMs enables 3D hand trajectory prediction in-the-wild for novel tasks in novel scenes! Very excited to share the EgoMan project, a huge effort led by @lasiafly during an internship with us at Meta. Video summary ⬇️

English

10.4K

Mingfei Chen@lasiafly·31 Ara

Code released here: github.com/facebookresear…

English

Mingfei Chen@lasiafly·19 Ara

🚀 Excited to share our new work EgoMAN: Reasoning-to-Motion for Egocentric 6-DoF Hand Trajectory Prediction 🤖🖐️ During my 6-month internship at @Meta, I explored how to connect high-level intent reasoning with physically grounded 6DoF hand motion in 3D space from noisy egocentric human videos. Our contributions and takeaways: • EgoMAN Dataset: a large-scale egocentric corpus with stage-aware 6-DoF hand trajectories + 3M semantic–spatial–motion QA pairs • EgoMAN Model: a modular reasoning-to-motion framework that bridges VLM reasoning and a flow-matching motion expert via compact trajectory tokens → 27.5% lower ADE than strong trajectory baselines and 70× faster than affordance-based waypoint predictors • Scalability: ≥4B models perform best; 4B offers the optimal speed–accuracy tradeoff • Applications: proactive assistants, goal-directed hand trajectory synthesis, and generalizable robotic manipulation Why it matters? Explicit reasoning tokens can guide motion generation, enabling stage-aware interaction learning from real egocentric videos, and unlocking new potential for robots and intelligent assistants. 📄 Paper: lnkd.in/g6C4in25 🌐 Project: lnkd.in/gfp_7y8S 💻 Code coming after internal review Huge thanks to my mentor and amazing coauthors at @Meta 🙏 Happy to chat about multimodal perception, spatial AI, and learning manipulation from human videos. #MultimodalAI #SpatialAI #EgocentricVision #3DReasoning #HumanObjectInteraction

English

282

Mingfei Chen retweetledi

UW ECE@uw_ece·24 Ara

Congrats to UW ECE doctoral student Mingfei Chen! She has received a 2025 Google PhD Fellowship for her work in spatially aware multimodal AI, helping machines understand 3D spaces. Her research could transform robotics, AR, accessibility tech, & more. ece.uw.edu/spotlight/ming… #AI

English

146

Mingfei Chen@lasiafly·3 Ara

Excited to present our Oral paper at NeurIPS 2025! 🎤 SAVVY: Spatial Awareness via Audio-Visual LLMs 📍 Oral: Dec 3, 10:40–11:00 AM PST (Hall F–H) 🪧 Poster: Dec 3, 11:00 AM–2:00 PM (Hall C–E, #5014) 📄 Paper: openreview.net/pdf?id=zwCb9cK… 🌐 Project: zijuncui02.github.io/SAVVY/ 💻 Code: github.com/shlizee/savvy Working on multimodal, 3D, or spatial AI? Happy to meet & coffee chat at NeurIPS! 🚀

English

102

Mingfei Chen@lasiafly·11 Haz

#LLM #GenAI #MultimodalLearning #SpatialReasoning #3DUnderstanding #AudioVisualLearning #FoundationModels

QHT

Mingfei Chen@lasiafly·11 Haz

📄 𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.05414 🌐 𝐏𝐫𝐨𝐣𝐞𝐜𝐭: zijuncui02.github.io/SAVVY/ Our work systematically explores how multi-modal LLMs perform spatial reasoning in 𝐝𝐲𝐧𝐚𝐦𝐢𝐜 3D environments using 𝐜𝐨𝐦𝐛𝐢𝐧𝐞𝐝 𝐯𝐢𝐬𝐮𝐚𝐥 𝐚𝐧𝐝 𝐚𝐮𝐝𝐢𝐭𝐨𝐫𝐲 information. Key findings:

English

Mingfei Chen@lasiafly·11 Haz

👁️👂 Can LLMs truly reason about 3D space through vision and sound? Unfortunately, current Audio-Visual LLMs only support monaural audio input, ignoring the rich spatial cues that humans naturally rely on for spatial perception and reasoning. Yet spatial reasoning in dynamic 3D scenes is critical for real-world applications like AR and robotics—both in terms of utility and safety. The gap between human-level spatial understanding and what current AV-LLMs can do is significant. Our new paper explores this challenge head-on: 𝐒𝐀𝐕𝐕𝐘: 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐀𝐰𝐚𝐫𝐞𝐧𝐞𝐬𝐬 𝐯𝐢𝐚 𝐀𝐮𝐝𝐢𝐨-𝐕𝐢𝐬𝐮𝐚𝐥 𝐋𝐋𝐌𝐬 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐒𝐞𝐞𝐢𝐧𝐠 𝐚𝐧𝐝 𝐇𝐞𝐚𝐫𝐢𝐧𝐠 More details in threads.

English

165

Mingfei Chen@lasiafly·11 Haz

📢 Benchmark dataset, evaluation toolkit, and code will be released soon!

English

Keşfet

@Stanford @nvidia @Meta @mangahomanga @elonmusk @BarackObama @taylorswift13 @cristiano