Embodied AI Reading Notes

150 posts

Embodied AI Reading Notes

@EmbodiedAIRead

Sharing daily personal notes on selected interesting Embodied AI papers, blogs and talks | Maintained by @yilun_chen_ | Opinions are my own.

California, USA Katılım Temmuz 2025

1 Takip Edilen3.1K Takipçiler

Embodied AI Reading Notes@EmbodiedAIRead·2 Mar

The flavor of the bitter lesson for computer vision Blog: vincentsitzmann.com/blog/bitter_le… This blog offers an interesting perspective on Bitter Lesson for the computer vision field: the author argues traditional computer vision, built around task-specific intermediate representations (e.g., classes, masks, 3D reconstructions), is becoming obsolete and will dissolve, and the real vision problem should be understood as end-to-end perception–action for embodied intelligence. This reflects the spirit of Rich Sutton’s Bitter Lesson: simple, scalable, general methods leveraging massive computation outperform handcrafted, modular systems.

English

1.5K

Embodied AI Reading Notes@EmbodiedAIRead·23 Şub

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy Project: showlab.github.io/World-VLA-Loop/ Paper: arxiv.org/pdf/2602.06508 This projects proposes a closed-loop paradigm that jointly optimizes the world model and VLA policy to iteratively enhance the performance and grounding of both. - Phase 1: Curate a large dataset via manual teleoperation and policy rollouts for success and near-success cases - Phase 2: Pretrain an action-conditioned world model using dataset from phase 1 with joint reward and video supervision - Phase 3: Executte VLA policy rollouts within world model to perform GRPO optimization for RL post-training - Phase 4: Deploy the finetuend policy to collect new failure and success data to augment dataset created in Phase 1. Continue the cycle for joint optimization.

English

257

18.4K

Embodied AI Reading Notes@EmbodiedAIRead·22 Şub

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes Project: scenesmith.github.io Paper: arxiv.org/abs/2602.09153 This project proposes a hierarchical, agentic framework for generating simulation-ready indoor environments from natural language. - The scene creation tasks is decomposed into a sequence of decision stages over layout, furniture, wall&ceiling, object manipulands population and refinement. - Each stage is implemented as an agentic interaction among VLM agents: designer, critic and orchestrator, each equipped with specialized tools. - Asset generation is directly integrated into scene construction process by an asset router which resolves to text-to-3D generative synthesis or dataset retrieval from articulated library, validated by physical property estimation to ensure simulation readiness. - Authors show an application to use the generated scene as simulation environment for automatic robot policy evaluation.

English

118

5.8K

Embodied AI Reading Notes@EmbodiedAIRead·21 Şub

HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos Project: wyhuai.github.io/human-x/ Paper: arxiv.org/pdf/2602.02473 This project learns generalizable and smooth real world object interactions skills for humanoids like passing, shooting basketball and kicking football in real world. - 3 stages in XGen, the data generation pipeline from human video (1) Extract human motion and retargeting to robot from video (2) physics-based synthesis of object trajectories coupled with contact-aware refinement (3) data augmentation through object geometry scaling and trajectory variation to maximize coverage of improved generalization - Xmimic, a unified imitation-framework that learns interaction skills by mimicking behaviors synthesized by XGen, follows 2 stage training pipeline (1) teacher learns with privilaged state information under a unified interaction imitation reward (2) distill into a student policy that operates under realistic perceptual constraints and deploys directly in real-world - Deployment can achieve basic behaviors without explicit external sensing but needs object sensing from Mocap system for sustained closed-loop interactions

English

2.8K

Embodied AI Reading Notes@EmbodiedAIRead·18 Şub

Contact-Aware Neural Dynamics Project: changwei-jing.github.io/neural-physics/ Paper: arxiv.org/pdf/2601.12796 This project proposes a sim-to-real alignment framework that learns to directly align simulator’s dynamics with real world contact information. - How it works: Uses the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. - Why we need this: the authors show the learned forward dynamics improves state prediction accuracy and can be effectively used to refine policies trained in simulators. - Pipeline: (1) First train a neural forward dynamics model in simulation using large-scale rollouts of a dexterous hand interacting with diverse objects under extensive domain randomization. (2) Then collect corresponding real-world trajectories, again including both successes and failures, augmented with tactile sensor readings, and fine-tune the simulation-only model with real-world data.

English

117

8.4K

Embodied AI Reading Notes@EmbodiedAIRead·17 Şub

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos Project: dreamdojo-world.github.io Paper: arxiv.org/abs/2602.06949 Nvidia releases DreamDojo, an action-conditioned video generation world model that learns diverse interactions and dexterous manipulation from 44k hours of human centric human videos. - Dataset: 44k hours of egocentric experience of human daily activities, which is so far largest and most diverse data corpus used for training robotics world model. - 3-stage pipeline: (1) Pretraining from human videos which introduce continuous latent actions as conditions for videos (2) Post-training on target robots (3) Distillation to improve real-time interactivity and context consistency. - Action representation: learn continuous latent action as a chunk - Distillation: Follow Self Forcing paradigm that enables autoregressive prediction of future frames at 640*480 resolution at 10.81 FPS.

English

137

9.6K

Embodied AI Reading Notes@EmbodiedAIRead·16 Şub

HuMI Humanoid Whole-Body Manipulation from Robot-Free Demonstrations Project: …noid-manipulation-interface.github.io/#/ Paper: arxiv.org/abs/2602.06643 This project applies UMI to learning humanoid whole-body manipulation tasks across environments by enabling robot-free data collection with portable hardware. - Challenges directly applying UMI: (1) underspecified demonstration with only grippers (2) feasibility gap (3) non-negligible tracking error on whole-body control - Adaption to UMI hardware setup: in addition to handheld sensorized grippers, trackers are also added to grippers, waist and feet for real time-IK of kinematic adaptation. - Hierarchical control policy framework: (1) 5HZ high level diffusion policy that processes camera images and proprioception to generate task-space SE(3) trajectories on keypoints of hands, wrist and feet (2) 50HZ low level whole body controller tracks these keypoint targets while integrating the current robot state to compute precise joint commands.

English

Embodied AI Reading Notes@EmbodiedAIRead·9 Şub

DreamZero: World Action Models are Zero-shot Policies Project: dreamzero0.github.io Paper: dreamzero0.github.io/DreamZero.pdf Code: github.com/dreamzero0/dre… Blog from lead author: joeljang.github.io/world-models-f… Nvidia released World Action Model that jointly predicts world states and actions using video diffusion model as dense representation of how world evolves. With model and system optimization, they enable the 14B autoregressive model to perform real-time closed-loop control at 7HZ. - Model architecture: (1) one single model end-to-end with video and action prediction objective (2) largely reuse WAN backbone, only introduce minimal additional parameters in state encoders, action encoders and decoders. (3) autoregressive prediction using teacher forcing as training objective (4) leverage KV caching to support long context - Real-time execution optimizations: (1) asynchronous closed-loop execution: predict action chunk and decouple inference from action execution (2) DiT caching: cache velocities when success velocities are similar - reduce diffusion step from 16 to 4 (3) DreamZero-Flash that decouples video and action noise schedules during training can further reduce diffusion step from 4 to 1 with reasonable performance (4) DreamZero can achieve 38x on GB200 with latency from 5.7s to 150ms - The model shows amazing cross-embodiment behavior: (1) video-only demonstration from other robot/human helps on unseen tasks with 10-20 minutes data (2) 30 minutes play data enables transferring to a new embodiment

English

211

11.1K

Embodied AI Reading Notes@EmbodiedAIRead·28 Oca

Introducing Helix 02: Full-Body Autonomy Blog: figure.ai/news/helix-02 Figure AI shows today an amazing demo of 4-minutes autonomous loco-manipulation task for unloading and reloading a dishwasher across kitchen. Here’s the system details behind it from the blog. - The whole-body loco-manipulation VLA (Helix 02) has 3 systems: (1) System 2 reasons slowly about goals: interpreting scenes, understanding language, and sequencing behaviors (2) System 1 thinks fast, translating perception into full-body joint targets at 200HZ (3) System 0 executes at 1kHZ, handling balance, contact, and coordination across the entire body. - System 0: foundation motion tracker for whole-body control while maintaining balance and stability. (1) training data: 1000 hours of joint-level retargeted human motion data (2) model size: 10M parameters (3) input full-body joint state and base motion and output joint-level actuator commands (4) latency: 1kHZ (5) trained entirely in sim across 200k+ parallel environments - System 1: visuomotor policy that takes sensor input and output joints. (1) input: head cameras, palm cameras, fingertip tactile sensors, full-body proprioception (2) output: joint-level control of entire robot (3) model size: 80M parameters (4) architecture: transformer conditioned on System 2 latents (5) latency: cameras 30HZ, tactile and proprioception 200HZ - System 2: scene understanding and language. (1) model size: 7B parameters (2) runs at a lower frequency (3) processing scenes, understanding language, and producing latent goals for System 1. - The new Helix 02 makes use of the new palm cameras and fingertip tactile sensors that come from the new hardware capabilities from Figure 03.

English

1.7K

Embodied AI Reading Notes@EmbodiedAIRead·26 Oca

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning Project: research.nvidia.com/labs/dir/cosmo… Paper: arxiv.org/pdf/2601.16163 Code: github.com/nvlabs/cosmos-… In this new work, researchers from Nvidia&Stanford developed a way to finetune Nvidia Cosmos video foundation model to serve as policy model, world model and RL value function model for robot manipulation tasks with no architectural changes. - Inputs and outputs for Cosmos Policy: (1) inputs: task description text, current robot proprioception, current multi-view images (2) outputs: robot action chunk, future states (next robot proprioception, next images observations), RL value function (expected reward-to-go) - Key idea to adapt video backbone without architectural changes: represent action chunk, future state and value function as latent diffusion frames so it naturally fits the video model’s latent diffusion process and also harnesses the model’s pretrained priors. - Joint training in latent diffusion scheme: during each training step, sample (s, a, s’, V(s’)). (1) 50% batches for policy training: given s, predict a, s’, V(s’) (2) 25% batches for world model training: given s,a, predict s’, V(s’) (3) 25% batches for Value Function training: given s,a,s’, predict V(s’) - Cosmos policy can be deployed as (1) a direct control policy (use a and discard s’, V(s’)) (2) a planning policy which uses s’ and V(s’) to do search with best-of-N sampling - Authors show some good results both in simulation and real world bi-manual manipulation tasks.

English

5.2K

Embodied AI Reading Notes@EmbodiedAIRead·23 Oca

Action100M: A Large-scale Video Action Dataset Paper: arxiv.org/pdf/2601.10592 Dataset: github.com/facebookresear… Meta just released and open-sourced a large human action video dataset containing ~100million action instances annotated by an automated pipeline. - The videos come from online instructional videos which capture people interacting with the physical world in diverse activities. - The auto-annotation pipeline creates dense action labels by generating a hierarchy of temporal segments with structured fields including action descriptions, the actor and video captions. - The dataset includes 1.2 million YouTube videos, roughly 14.6 years of duration and 147 million segment-level annotations, roughly 21.3 billion words. The annotation took 1.3 million V100 GPU hours for segmentation and captioning, and 0.3 million H100/H200 GPU hours for LLM aggregation.

English

121

30.2K

Embodied AI Reading Notes@EmbodiedAIRead·21 Oca

From Generated Human Videos to Physically Plausible Robot Trajectories Project: genmimic.github.io Paper: arxiv.org/pdf/2512.05094 This paper showcases a pipeline that a G1 humanoid robot can perform novel third-person viewpoint actions generated from a video generation model. - Overall pipeline: Generated RGB video -> 4D reconstructed human of SMPL trajectory -> Retargeted humanoid 3D keypoints trajectory -> policy execution on real robot - Policy Training: typical student-teacher framework trained in IsaacGym on retargeted AMASS trajectories. Use weighted keypoint rewards and symmetry loss to improve training efficiency. - Authors show the robot can execute human actions from generated videos in a zero-shot manner, though the motion quality is limited.

English

145

8.4K

Embodied AI Reading Notes@EmbodiedAIRead·18 Oca

1X World Model | From Video to Action: A New Way Robots Learn Blog: 1x.tech/discover/world… 1X describes and shows initial results for a new potential way of learning robot policy using video generation based world modeling, compared to VLA which is based on VLM. - How it works: at inference time, the system receives a text prompt and a starting frame. The World Model rolls out the intended future image frames, the Inverse Dynamics Model extracts the trajectory, and the robot executes the sequence in the real world. - The World Model backbone: A text-conditioned diffusion model trained on web-scale video, mid-trained on 900 hours of egocentric human data of first-person manipulation tasks for capturing general manipulation behaviors, and fine-tuned on 70 hours of NEO-specific sensorimotor logs for adapting to NEO’s visual appearance and kinematics. - The Inverse Dynamics Model: similar to architecure used in DreamGen, and trained on 400 hours of robot data on random play and motions. - Results: The model can generate videos aligning well with real-world execution, and the robot can perform object grasping, manipulation with some degree of generalization. - Current limitations: The pipeline latency is high and it’s not lose-loop. Currently the WM takes 11 second to generate 5 second video on a multi-GPU server and IDM takes another 1 second to extract actions.

English

349

31.3K

Embodied AI Reading Notes@EmbodiedAIRead·13 Oca

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models Project: agibot.com/research/sop Paper: arxiv.org/pdf/2601.03044 This new work from AgiBot explores a scalable online post-training system for VLA with a fleet of distributed robot fleet learning in real world. - Key idea: SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. - Why: tightly coupling learning and execution yields a unified feedback loop that enables timely on-policy correction, scales exploration through parallel experience, and preserves generality during adaptation. This proves the insights that static datasets cannot fully anticipate the state distribution induced by a deployed policy. - SOP is agnostic to the choice of post-training algorithm: authors tried HG-DAgger and RECAP for interactive imitation learning and reinforcement learning method. - SOP enables post-training in real world on order of hours, and scales performance near-linear to number of robots, to substantially improvement performance for tasks like grocery restocking, laundry folding, box assembly.

English

Embodied AI Reading Notes@EmbodiedAIRead·11 Oca

GR-Dexter Technical Report Project: byte-dexter.github.io/gr-dexter/ Paper: arxiv.org/abs/2512.24210 ByteDance released its dexterous hand VLA model along with its new 21 DoF hand design, achieving good performance on long-horizon manipulation tasks. - ByteDexter hand: 21 Dogs with tactile finger tips. - Data collection: bimanual teleoperation interface comprising a Meta Quest VR for wrist pose tracking, two Manus Metagloves for hand movement capture, and foot pedals for arm control. Human motions are retargeted in realtime to join position commands. - Training data recipe: vision language data + human trajectory data + cross-embodiment data + robot trajectory. For cross-embodiment and human trajectory data, a preprocessing and retargeting pipeline is used to ensure quality and align kinematics.

English

578

24.3K

Embodied AI Reading Notes@EmbodiedAIRead·10 Oca

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation Project: point-world.github.io Paper: arxiv.org/abs/2601.03782 This new work explores a different type of world modeling for robot manipulation: by representing both environment and action as 3D points, world modeling can be formulated as 3D point flow prediction. - Task definition: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. - Action representation: instead of commonly used joint angles/positions, authors represent actions as 3D point flows, which unifies state and action in a shared 3D space. This formulation is also universal regardless of robot type. - Dataset: 2M trajectories/500 hours of robot manipulation in real and simulated environments with 3D info. - Authors showed a proof-of-concept by integrating this into a MPC framework and demonstrated some real world manipulation capabilities.

English

3.7K

Embodied AI Reading Notes@EmbodiedAIRead·6 Oca

Large Video Planner Enables Generalizable Robot Control Project: boyuan.space/large-video-pl… Paper: arxiv.org/abs/2512.15840 Code: github.com/buoyancy99/lar… This paper explores a new way to use video generation model to learn robot policy: use finetuned video generation model to produce zero-shot video plans for novel scenes and tasks, and then post-process to extract executable robot actions. - Conditional video generation model: given text instructions and initial observation frame(s), generate video plans. Carefully finetuned with large-scale internet and robot videos, the model use History Guidance and Diffusion Forcing to enhance temporal coherence and causal consistency in generation. - The paper also open-source the curated diverse and high-quality 1.4M clips of human/robots interacting with objects they used for training. - From video to action: generated video -> hand pose estimation -> wrist motion & finger retargeting -> robot execution. - The authors show the generated policy can run on a G1 with dexterous hand to do grasping etc in real world. - Limitation: the video generation takes minutes making the policy open-loop and real-time deployment on robots intractable.

English

5.7K

Embodied AI Reading Notes@EmbodiedAIRead·5 Oca

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation Project: tacthru.yuyang.li Paper: arxiv.org/abs/2512.09851 This project developed a new type of See-Through-Skin sensor that combines tactile and visual perception, and integrated these signals into a transformer-based diffusion policy to achieve real-world tasks that improves contact-rich manipulation. - TacThru hardware: simultaneous tactile-visual perception with transparent cover, persistent illumination and robust keyline markers for tracking. - TacThru-UMI imitation learning framework: extend UMI and Diffusion Policy with tactile-visual observations for data collection, processing, and policy deployment. - Policy Learning: A transformer-based diffusion policy that maps from multimodal observations to robot actions, attending simultaneously across visual, tactile, and proprioceptive signals. - Through experiments, authors found TacThru’s multimodal feedback allows policies to leverage detailed environmental cues for manipulation, e.g. when handling extremely thin and soft objects.

English

5.2K

Embodied AI Reading Notes@EmbodiedAIRead·3 Oca

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System Project: yanjieze.com/TWIST2/ Paper: yanjieze.com/TWIST2/ Code: github.com/amazon-far/TWI… Data: twist-data.github.io This project introduces a low-cost, mocap-free, easy-to-setup teleoperation data collection system for G1 humanoid whole-body control and use this system and data to train a whole-body visuomotor policy with egocentric vision. - System formulation: (1) low-level control: trained as a general motion tracker at 50HZ, which is task-agnostic (2) high-level control: generate motion commands conditioned on egocentric vision. - Hardware setup: (1) self-designed 2DoF robot neck for egocentric vision (2) PICO 4U head set and 2 PICO Motion Tracker for motion tracking teleop. - Authors use an adapted version of GMR, a real-time motion retargeting method for human-to-humanoid retargeting - Hierarchical visual policy framework: (1) system 1 general motion tracker is trained with dedicated motion dataset using RL in large-scale sim. (2) system 2 visuomotor policy is a diffusion policy to predict whole-body joint positions using teleop collected observation-action pairs. - This system enables simple vision-based autonomous control of full humanoid body task such as whole-body pick&place and kicking box to target region. - The entire system and collected dataset are open-sourced.

English

5.3K

Embodied AI Reading Notes@EmbodiedAIRead·1 Oca

EgoX: Egocentric Video Generation from a Single Exocentric Video Project: keh0t0.github.io/EgoX/ Paper: arxiv.org/pdf/2512.08269 This paper achieves high geometric coherence and visual fidelity when generating egocentric videos from a single exocentric video. - Why this is important: consistent egocentric videos open up a lot of new spaces for robot learning. If egocentric videos can be generated in high quality, diversity and quantity, it makes synthetic data a powerful complement to real-world data. - Problem definition: given an exocentric video sequence and ego-centric camera poses, goal is to generate a corresponding ego centric video sequence that depicts the same scene from first-person viewpoint. - Challenge: preserve the visible content in the exocentric view while synthesizing unseen regions in a geometrically consistent and realistic manner. - Method: (1) The exocentric sequence is first lifted into a 3D Point Cloud representation and rendered from the target egocentric viewpoint which becomes an egocentric prior video. (2) This prior video and original exocentric video are then provided as inputs to a LORA-adapted pretrained video diffusion model to generate egocentric video. (3) A geometric-guided self-attention in DiT is used to adaptively focus on view-consistent regions and enhance feature coherence across perspectives.

English

111

6.8K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry