
Roei Herzig
2.1K posts

Roei Herzig
@roeiherzig
Researcher @IBMResearch. Postdoc @berkeley_ai. PhD @TelAvivUni. Working on Compositionality, Multimodal Foundation Models, and Structured Physical AI.



With Emmanuel Dupoux scp.net/persons/dupoux/ and Yann LeCun @ylecun, we consider a cognitive science inspired AI. We analyse how autonomous learning works in living organisms, and propose a roadmap for reproducing it in artificial systems. lnkd.in/eNWDmuqT

The 5th edition of the MMFM Workshop is coming to @CVPR 2026! "What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond. June 2026 | Denver, CO Details in thread 👇

The 5th edition of the MMFM Workshop is coming to @CVPR 2026! "What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond. June 2026 | Denver, CO Details in thread 👇

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

Pretraining with dynamics models of motor behavior (aka world models) from video will be much more central to robotics than VLMs. There are multiple choices of representations (e.g. 3D? JEPA?) but we will figure this out by and by. Exciting times!



In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…



In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away. vincentsitzmann.com/blog/bitter_le…




Academic relaxation ladder: Undergrad: relaxes from homework by doing extracurriculars PhD student: relaxes from research by doing homework Professor: relaxes from admin by doing research






my advice for robot enthusiasts don't go into the fancy stuff without the fundamentals robots have been in the wild for roughly 60 years, there are many bitter lessons pilled in their classical operations probrably most "AI" advocates can't recite what a pid loop is or have never touched sensor fusion this knowledge defines whether you're just here for the hype or you're here for the actual long game i.e. dynamics r crucial for implementing guardrails and safety procedures, mpc is actually used along with world models for sampling trajectories, sim2real requires realistic physics modeling and control modeling, etc determinism and formalalities will be one of the biggest problems to solve in next gen robotics and it won't come from folks who r just jumping the gun




1X World Model | From Video to Action: A New Way Robots Learn Blog: 1x.tech/discover/world… 1X describes and shows initial results for a new potential way of learning robot policy using video generation based world modeling, compared to VLA which is based on VLM. - How it works: at inference time, the system receives a text prompt and a starting frame. The World Model rolls out the intended future image frames, the Inverse Dynamics Model extracts the trajectory, and the robot executes the sequence in the real world. - The World Model backbone: A text-conditioned diffusion model trained on web-scale video, mid-trained on 900 hours of egocentric human data of first-person manipulation tasks for capturing general manipulation behaviors, and fine-tuned on 70 hours of NEO-specific sensorimotor logs for adapting to NEO’s visual appearance and kinematics. - The Inverse Dynamics Model: similar to architecure used in DreamGen, and trained on 400 hours of robot data on random play and motions. - Results: The model can generate videos aligning well with real-world execution, and the robot can perform object grasping, manipulation with some degree of generalization. - Current limitations: The pipeline latency is high and it’s not lose-loop. Currently the WM takes 11 second to generate 5 second video on a multi-GPU server and IDM takes another 1 second to extract actions.


