
MindJourney Test-Time Scaling with World Models for Spatial Reasoning
Zeyuan Yang
5 posts

@miiche_yang
UMass PhD | Current Intern @ Samsung | Previous @ THU

MindJourney Test-Time Scaling with World Models for Spatial Reasoning



Machine Mental Imagery Empower Multimodal Reasoning with Latent Visual Tokens

VLM can think visually without generating pixels! VLM can think visually without generating pixels! VLM can think visually without generating pixels! 📢 We introduce Machine Mental Imagery (Mirage): a new framework that enables VLM to imagine using latent visual tokens—performing visual reasoning in latent space, no pixel rendering needed! We achieve this through a two-phase training paradigm: ✅ Stage 1: Grounding latent tokens in the visual subspace (joint supervision) ✅ Stage 2: Anchoring grounded tokens for generation (text-only supervision) Mirage demonstrates strong performance on a wide range of multimodal reasoning tasks! 📜Paper: arxiv.org/abs/2506.17218 🧑💻Code: github.com/UMass-Embodied… 📽️Project Page: vlm-mirage.github.io
