
I’m so lucky to have such amazing students! 🤩 🦾🧑🎓
Furong Huang
2.1K posts

@furongh
Associate professor of @umdcs @umiacs @ml_umd at UMD. Researcher in #AI/#ML, AI #Alignment, #RLHF, #Trustworthy ML, #EthicalAI, AI #Democratization, AI for ALL.

I’m so lucky to have such amazing students! 🤩 🦾🧑🎓









I defended my PhD thesis! Also, a very (~4 month) late life update, but I've joined @OpenAI to work on safety research and pretraining safer language models! 📈 Thank you to my advisor @zicokolter and my committee: Matt Fredrikson, @andrew_ilyas, and @furongh! 🙏



Proud to introduce EgoScale: We pretrained a GR00T VLA model on 20K+ hours of egocentric human video and discovered that robot dexterity can be scaled, not with more robots, but with more human data. A thread on 🧵what we learned. 👇

🚫 #Reasoningmodels improve AI capabilities (IMO, Olympiad), but degrade #Safety #Alignment ❓ Are we doomed? 📢 Safety recovery is easier than you think (just a few steering steps away) Surprisingly simple safety recovery maintaining utility of MLRMs: arxiv.org/pdf/2602.11096

This is a really crisp articulation of what “embodied intelligence” has been missing: a task-faithful interface between pixels and plans. For years we have argued about end-to-end policies vs modular pipelines, VLM planners vs classical task planning, “3D scene understanding” vs “affordances”. But the real bottleneck is simpler: **Robots fail in homes not because they can’t see, but because they can’t commit to the right structure of what they saw.** **Why this matters** A household is not a static 3D reconstruction problem. It is a stateful and interactive world: •“Where” matters (spatial relations, occlusions, reachability), •“How” matters (affordances, parts, functional constraints), •and “What changed” matters most (open/closed, filled/empty, on/off, moved/blocked). Most existing “scene graphs” choose one axis: •spatial graphs: geometry-rich, action-poor •functional graphs: affordance-rich, geometry-weak MomaGraph’s key move is to unify both and make state first-class, with part-level interactive nodes. That’s not just a better representation — it’s the right abstraction layer for embodied reasoning. **Graph-then-Plan is a field-defining direction** The “Graph-then-Plan” paradigm is more than a technique – it’s a thesis: Stop asking a VLM to hallucinate a plan directly from pixels. Force it to externalize the relevant world model first. This is exactly how we make VLM-based agents: •more grounded (reduce free-form hallucination), •more auditable (the graph is inspectable, editable, debuggable), •more composable (graphs can be reused across tasks, skills, and time), •more trainable (reward the intermediate structure, not just the final answer). I also like the RL angle (MomaGraph-R1 on top of a 7B VLM): it suggests a practical recipe for future embodied foundation models: 1.learn a structured latent that matches the environment’s causal affordances 2.learn planning on top of that latent 3.evaluate both separately, then jointly **Datasets and benchmarks are the leverage** Releasing MomaGraph-Scenes + MomaGraph-Bench is arguably as important as the model: •If we want progress, we need standardized targets for what structure is “correct” in a household. •The six capability axes (from fine-grained affordance reasoning to long-horizon decomposition) is exactly the right shape of benchmark for embodied VLMs. **The big picture** If we zoom out, this is part of a broader convergence: Embodied AI is becoming representation learning again — but not “representation” as a hidden vector. Representation as a contract: •between perception and action, •between language and physics, •between what the agent believes now and what it will remember later. In that view, MomaGraph is a step toward a future where robots carry persistent, state-aware, task-conditioned world models that can be updated, queried, and reasoned over – not just prompted. Very excited to see where this goes, especially as we push toward: •temporal graphs (state updates as events), •uncertainty-aware graphs (confidence as a first-class signal), •active perception (ask-for-views to resolve graph ambiguity), •and lifelong memory (graphs as the substrate of agent memory in real homes). Kudos to the team – this feels like the kind of work that doesn’t just improve a leaderboard, it clarifies the roadmap.