Mingfei Chen

32 posts

Mingfei Chen

Mingfei Chen

@lasiafly

PhD Student @UW | PhD Fellow @Google | Multi-Modal LLMs, Spatial AI, XR, Robotics | Previously @Meta & @NUSingapore

Seattle, WA Katılım Aralık 2021
130 Takip Edilen61 Takipçiler
Mingfei Chen
Mingfei Chen@lasiafly·
AI Native might be the next future of open source repo style. Traditional frameworks handle complexity with massive configuration files. I found an interesting repo that were released just now - NanoClaw suggests a different paradigm: Software that evolves by rewriting itself, rather than just changing settings. Key Features from the repo: • Zero Config: The code is the state. The AI refactors the source to add features instead of toggling flags. • Skills > Features: You teach the system how to build a feature (via Skill MDs) rather than merging bloated PRs for every use case. • True Isolation: Runtimes as sandboxed containers, not just permission checks. We are moving from building static tools to designing self-evolving seeds. Fascinating experiment. #AIAgents #AgenticAI #OpenSource #OpenClaw github.com/gavrielc/nanoc…
English
0
0
2
310
Mingfei Chen
Mingfei Chen@lasiafly·
Nice work, great spatial consistency! I wonder how easily the explicit 3D memory can be adapted to dynamic scenes video gen. Also in terms of 3d mem, we may be able to simplify the full detailed representation like point cloud to some sparse cues? Since we have video generation head to infill the details.
English
0
0
0
148
Kwang Moo Yi
Kwang Moo Yi@kwangmoo_yi·
Zhao and Wei et al., "Spatia: Video Generation with Updatable Spatial Memory" "Memory" for video models via point cloud-conditioned video generation. I am obviously still biased towards having these "explicit" 3D stuff.
English
8
25
239
14.3K
Mingfei Chen retweetledi
Xiao Fu
Xiao Fu@lemonaddie0909·
Video generation, but 4D, dynamic, scene-consistent, and very long at the same time?! Introducing 𝐏𝐥𝐞𝐧𝐨𝐩𝐭𝐢𝐜𝐃𝐫𝐞𝐚𝐦𝐞𝐫, 𝐦𝐮𝐥𝐭𝐢-𝐯𝐢𝐞𝐰 𝐯𝐢𝐝𝐞𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐰𝐢𝐭𝐡 𝐥𝐨𝐧𝐠-𝐭𝐞𝐫𝐦 𝐬𝐩𝐚𝐭𝐢𝐨-𝐭𝐞𝐦𝐩𝐨𝐫𝐚𝐥 𝐦𝐞𝐦𝐨𝐫𝐲! The scaling secret is very simple: an autoregressive paradigm with minimal 3D inductive bias, aided with a spatially grounded memory retrieval mechanism. 🌐 Project page: research.nvidia.com/labs/dir/pleno… 🌐 Paper: arxiv.org/pdf/2601.05239
English
14
60
393
33.3K
Mingfei Chen retweetledi
Wenlong Huang
Wenlong Huang@wenlong_huang·
What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 point-world.github.io from @Stanford @nvidia
English
23
225
1.3K
234.9K
Mingfei Chen retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Introducing Qwen3-VL-Embedding and Qwen3-VL-Reranker – advancing the state of the art in multimodal retrieval and cross-modal understanding! ✨ Highlights: ✅ Built upon the robust Qwen3-VL foundation model ✅ Processes text, images, screenshots, videos, and mixed modality inputs ✅ Supports 30+ languages ✅ Achieves state-of-the-art performance on multimodal retrieval benchmarks ✅ Open source and available on Hugging Face, GitHub, and ModelScope ✅ API deployment on Alibaba Cloud coming soon! 🎯 Two-stage retrieval architecture: 📊 Embedding Model – generates semantically rich vector representations in a unified embedding space 🎯 Reranker Model – computes fine-grained relevance scores for enhanced retrieval accuracy 🔍 Key application scenarios: Image-text retrieval, video search, multimodal RAG, visual question answering, multimodal content clustering, multilingual visual search, and more! 🌟 Developer-friendly capabilities: • Configurable embedding dimensions • Task-specific instruction customization • Embedding quantization support for efficient and cost-effective downstream deployment Hugging Face: huggingface.co/collections/Qw… huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… modelscope.cn/collections/Qw… Github: github.com/QwenLM/Qwen3-V… Blog: qwen.ai/blog?id=qwen3-… Tech Report:github.com/QwenLM/Qwen3-V…
Qwen tweet media
English
48
291
1.9K
210.1K
Mingfei Chen
Mingfei Chen@lasiafly·
Happy New Year! 🎉 Over the past two years, I spent nearly half of each year as a research intern at @Meta, working with Reality Labs Research in Pittsburgh and Redmond. As #2026year begins, I’ve now wrapped up this chapter. These experiences deeply shaped how I think about research and system building, from problem formulation to multimodal modeling and long-term vision. Grateful for the mentorship, collaboration, and research culture. I’m glad to share that two major projects from this time are now fully wrapped up and released, both centered on multimodal 3D learning. 👇 #MultimodalAI #3DReasoning #SpatialAI #AudioVisual #EgocentricVision #RealityLabs #Research
English
3
0
3
144
Mingfei Chen retweetledi
Homanga Bharadhwaj
Homanga Bharadhwaj@mangahomanga·
Motion cues from human videos + Reasoning from VLMs enables 3D hand trajectory prediction in-the-wild for novel tasks in novel scenes! Very excited to share the EgoMan project, a huge effort led by @lasiafly during an internship with us at Meta. Video summary ⬇️
English
4
10
93
10.4K
Mingfei Chen
Mingfei Chen@lasiafly·
🚀 Excited to share our new work EgoMAN: Reasoning-to-Motion for Egocentric 6-DoF Hand Trajectory Prediction 🤖🖐️ During my 6-month internship at @Meta, I explored how to connect high-level intent reasoning with physically grounded 6DoF hand motion in 3D space from noisy egocentric human videos. Our contributions and takeaways: • EgoMAN Dataset: a large-scale egocentric corpus with stage-aware 6-DoF hand trajectories + 3M semantic–spatial–motion QA pairs • EgoMAN Model: a modular reasoning-to-motion framework that bridges VLM reasoning and a flow-matching motion expert via compact trajectory tokens → 27.5% lower ADE than strong trajectory baselines and 70× faster than affordance-based waypoint predictors • Scalability: ≥4B models perform best; 4B offers the optimal speed–accuracy tradeoff • Applications: proactive assistants, goal-directed hand trajectory synthesis, and generalizable robotic manipulation Why it matters? Explicit reasoning tokens can guide motion generation, enabling stage-aware interaction learning from real egocentric videos, and unlocking new potential for robots and intelligent assistants. 📄 Paper: lnkd.in/g6C4in25 🌐 Project: lnkd.in/gfp_7y8S 💻 Code coming after internal review Huge thanks to my mentor and amazing coauthors at @Meta 🙏 Happy to chat about multimodal perception, spatial AI, and learning manipulation from human videos. #MultimodalAI #SpatialAI #EgocentricVision #3DReasoning #HumanObjectInteraction
English
1
1
5
282
Mingfei Chen retweetledi
UW ECE
UW ECE@uw_ece·
Congrats to UW ECE doctoral student Mingfei Chen! She has received a 2025 Google PhD Fellowship for her work in spatially aware multimodal AI, helping machines understand 3D spaces. Her research could transform robotics, AR, accessibility tech, & more. ece.uw.edu/spotlight/ming… #AI
UW ECE tweet media
English
0
1
0
146
Mingfei Chen
Mingfei Chen@lasiafly·
📄 𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.05414 🌐 𝐏𝐫𝐨𝐣𝐞𝐜𝐭: zijuncui02.github.io/SAVVY/ Our work systematically explores how multi-modal LLMs perform spatial reasoning in 𝐝𝐲𝐧𝐚𝐦𝐢𝐜 3D environments using 𝐜𝐨𝐦𝐛𝐢𝐧𝐞𝐝 𝐯𝐢𝐬𝐮𝐚𝐥 𝐚𝐧𝐝 𝐚𝐮𝐝𝐢𝐭𝐨𝐫𝐲 information. Key findings:
English
4
0
1
85
Mingfei Chen
Mingfei Chen@lasiafly·
👁️👂 Can LLMs truly reason about 3D space through vision and sound? Unfortunately, current Audio-Visual LLMs only support monaural audio input, ignoring the rich spatial cues that humans naturally rely on for spatial perception and reasoning. Yet spatial reasoning in dynamic 3D scenes is critical for real-world applications like AR and robotics—both in terms of utility and safety. The gap between human-level spatial understanding and what current AV-LLMs can do is significant. Our new paper explores this challenge head-on: 𝐒𝐀𝐕𝐕𝐘: 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐀𝐰𝐚𝐫𝐞𝐧𝐞𝐬𝐬 𝐯𝐢𝐚 𝐀𝐮𝐝𝐢𝐨-𝐕𝐢𝐬𝐮𝐚𝐥 𝐋𝐋𝐌𝐬 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐒𝐞𝐞𝐢𝐧𝐠 𝐚𝐧𝐝 𝐇𝐞𝐚𝐫𝐢𝐧𝐠 More details in threads.
English
3
1
2
165
Mingfei Chen
Mingfei Chen@lasiafly·
📢 Benchmark dataset, evaluation toolkit, and code will be released soon!
English
0
0
1
61