

Yishu Li
43 posts

@LisaYishu
MSR @CMU_Robotics, Prev CS Undergrad @Tsinghua_Uni





🤖For embodied agents in household environments, we tackle two fundamental questions: 1️⃣ What is the optimal scene representation? 2️⃣ Can a VLM leveraging this representation actually improve spatial understanding and task planning? Introducing MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning. 👉: hybridrobotics.github.io/MomaGraph/ and 🔗:arxiv.org/abs/2512.16909 Key Ideas: MomaGraph jointly models spatial AND functional relationships with part-level interactive nodes. MomaGraph is designed to be: ✅ Task-Relevant: Filters visual noise to keep only what matters for the instruction. ✅ Dynamic & State-Aware: MomaGraph adapts. 🔄 It explicitly models object states and dynamic changes in the environment. We built MomaGraph to bridge the gap between the Spatial VLM and Robotics communities. 🌉 Our hope is that this work serves as a foundation for the next generation of intelligent, adaptive embodied agents. 🦾✨Questions and feedback welcome. 🚀 #Robotics #EmbodiedAI #CV #LLM #SceneGraph




🚨Introducing SPOT: Search over Point Cloud Object Transformations. SPOT is a combined learning-and-planning approach that searches in the space of object transformations. Website: planning-from-point-clouds.github.io Paper: arxiv.org/abs/2509.04645 Code: github.com/kallol-saha/SP…

A closed door looks the same whether it pushes or pulls. Two identical-looking boxes might have different center of mass. How should robots act when a single visual observation isn't enough? Introducing HAVE 🤖, our method that reasons about past interactions online! #CORL2025

A closed door looks the same whether it pushes or pulls. Two identical-looking boxes might have different center of mass. How should robots act when a single visual observation isn't enough? Introducing HAVE 🤖, our method that reasons about past interactions online! #CORL2025




🚀 Excited to introduce SAFE, our work on multitask failure detection for Vision-Language-Action (VLA) models! 🔍 SAFE is a simple yet powerful detector that leans from VLAs’ semantic-rich internal feature space and outputs a scalar score indicating the likelihood of task failure
