

Mingkai Deng
168 posts

@mdeng34
PhD student @LTIatCMU | MSML @MLDCMU | BA Math-Stats + CS @Columbia | World models and agent models | @IFM_MBZUAI @MSFTResearch







We discussed foundational issues underlying how to make truly “agentive” systems. arxiv.org/abs/2606.23991

Robot learning is moving beyond policies built for one robot, one scene, one task. At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies. Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments. From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand. Different robots. Different environments. Different tasks. Same video planner. Same weights. We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread: 🔗 vera.csail.mit.edu 🧵 (1/7)








PPO: rejected from NIPS 2017

With the rise of LLM systems marketed as "coding agents", "AI co-scientists", etc. that promise to drive up productivity, and at the same time outcry of "existential" concerns that AI escaping human control with destructive power under a speculative "machine agency" against humans, there has been lots of confusion about “What is an agent?” and “What constitutes agency?” It has become essential to clarify where automation ends and agency begins. Also recently, developments in world models, action models are trending to mixing future prediction/simulation and action/plan generation altogether within a single architecture such as a VLM, conflating reward-driven action selection with fidelity-driven next-state prediction, undermining the reliability of both planning and simulation. In this paper we analyze agent architectures along the axis of goal, identity, decision-making, self-regulation, and learning, and argue that genuine agency requires these structures to be internalized within the system itself rather than assembled through external scaffolding. We propose a “Goal-Identity-Configurator” (GIC) architecture for a general-purpose agent model, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from both real and simulated experience. Auditability, controllability, and safety of systems that possess greater autonomy and "agency” but remain under human oversight, can be better built with the GIC architecture that offers transparency, modularity, and checkpoints. @mdeng34 , @jinyuhou0 openreview.net/forum?id=6fDZY…


This has been sitting on arxiv for a bit, but figured it's time to announce it properly. Introducing Behavior Cues: a way to make LLM reasoning more monitorable and controllable for scalable oversight.



With the rise of LLM systems marketed as "coding agents", "AI co-scientists", etc. that promise to drive up productivity, and at the same time outcry of "existential" concerns that AI escaping human control with destructive power under a speculative "machine agency" against humans, there has been lots of confusion about “What is an agent?” and “What constitutes agency?” It has become essential to clarify where automation ends and agency begins. Also recently, developments in world models, action models are trending to mixing future prediction/simulation and action/plan generation altogether within a single architecture such as a VLM, conflating reward-driven action selection with fidelity-driven next-state prediction, undermining the reliability of both planning and simulation. In this paper we analyze agent architectures along the axis of goal, identity, decision-making, self-regulation, and learning, and argue that genuine agency requires these structures to be internalized within the system itself rather than assembled through external scaffolding. We propose a “Goal-Identity-Configurator” (GIC) architecture for a general-purpose agent model, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from both real and simulated experience. Auditability, controllability, and safety of systems that possess greater autonomy and "agency” but remain under human oversight, can be better built with the GIC architecture that offers transparency, modularity, and checkpoints. @mdeng34 , @jinyuhou0 openreview.net/forum?id=6fDZY…


test-time compute [ttc] in robotics isn't free & isn't always worth it. smart allocation of ttc recovers frontier-level planning at a fraction of the cost! coauthor @milanganai w/ Yasmina @ajaysridhar0 Mozghan @katielulula Clark Barrett @jiajunwu_cs @chelseabfinn @drmapavone 🧵






