
The 5th edition of the MMFM Workshop is coming to @CVPR 2026! "What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond. June 2026 | Denver, CO Details in thread 👇
David Chan
97 posts

@_dmchan
Postdoc at @berkeley_ai studying contextual grounding in multimodal AI. These are the voyages of the... Crap. I don't have a name for my own ship...

The 5th edition of the MMFM Workshop is coming to @CVPR 2026! "What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond. June 2026 | Denver, CO Details in thread 👇

The 5th edition of the MMFM Workshop is coming to @CVPR 2026! "What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond. June 2026 | Denver, CO Details in thread 👇

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42 @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]


The Computer Science section of @arxiv is now requiring prior peer review for Literature Surveys and Position Papers. Details in a new blog post

Chinese doordash dropping MIT license foundation video models??? “We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across Text-to-Video, Image-to-Video, and Video-Continuation generation tasks.” huggingface.co/meituan-longca…

Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out dynamic-lm.github.io

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 echo-bench.github.io






Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them. We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel. But what if the real bottleneck isn’t size—but depth?What if the model just didn’t have enough serial steps to get it right? Some problems need depth, not width. This is the Serial Scaling Hypothesis. This is not the same as recent studies in scaling test-time compute, which focus on train vs. test and are agnostic to parallel vs. serial. For example: test-time majority voting increases compute by running models in parallel — but doesn’t help when the task itself is serial. We argue: what really matters is how the compute is structured. And for many real-world problems, it must be serial. Read more at: arxiv.org/abs/2507.12549 or 🧵. (In collaboration with: @layer07_yuxi , Kananart Kuwaranancharoen and @YutongBAI1002 )


🚀 Call for Papers! 🚀 Excited to help organize the 4th Workshop on What is Next in Multimodal Foundation Models? at ICCV in Honolulu, Hawai'i 🌺 Submit work on vision, language, audio & more! 🗓️ Deadline: July 1, 2025 🔗 sites.google.com/view/mmfm4thwo… #MMFM4 #ICCV2025 #AI #multimodal



🚀 Call for Papers! 🚀 Excited to help organize the 4th Workshop on What is Next in Multimodal Foundation Models? at ICCV in Honolulu, Hawai'i 🌺 Submit work on vision, language, audio & more! 🗓️ Deadline: July 1, 2025 🔗 sites.google.com/view/mmfm4thwo… #MMFM4 #ICCV2025 #AI #multimodal