
Ziqi Ma
68 posts

Ziqi Ma
@ziqi__ma
PhD student @caltech, research intern @AIatMeta, previously @microsoft. https://t.co/S1J9LItcXw




Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!





(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels





Introducing Large Video Planner (LVP-14B) — a robot foundation model that actually generalizes. LVP is built on video gen, not VLA. As my final work at @MIT, LVP has all its eval tasks proposed by third parties as a maximum stress test, but it excels!🤗 boyuan.space/large-video-pl…













