Rowan Zellers
591 posts

Rowan Zellers
@rown
multimodal @thinkymachines. I also like to climb rocks and throw pottery. https://t.co/5Er4j39K71 (he/him)


We are partnering with @nvidia to power our frontier model training and platforms delivering customizable AI. thinkingmachines.ai/news/nvidia-pa…

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.














Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖 arxiv.org/abs/2512.05103






