Mu Cai

359 posts

Mu Cai

Mu Cai

@MuCai7

Research @thinkymachines | Previous: multimodal, agents @GoogleDeepMind

Mountain View Katılım Mayıs 2019
1.5K Takip Edilen3.4K Takipçiler
Mu Cai retweetledi
Harris Zhang
Harris Zhang@HyperStorm9682·
🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 huggingface.co/papers/2605.24…
Harris Zhang tweet media
English
1
18
72
15.5K
Mu Cai
Mu Cai@MuCai7·
Wow, always high quality papers from Xueyan and Yuheng, could be a good measure for video generation!
Xueyan Zou@xyz2maureen

🔥Excited to share the first released work from our IEI lab! Congrats to @AnteaWu 🎉 This work is motivated by the lack of quantitative evaluation for physics alignment in video world models. With tools like MegaSam and CoTracker, we can directly reconstruct dynamic 3D scenes, enabling quantitative evaluation of physical alignment. Both code and data are released — feel free to try it out! It should work, but if it doesn’t, contact @AnteaWu directly : )

English
0
1
16
3.9K
Shilong Liu
Shilong Liu@atasteoff·
Career Update: I will join the Department of Electrical Engineering at Columbia University as a tenure-track Assistant Professor, starting in Fall 2027. My research will focus primarily on computer vision, self-evolving agents, and world models for embodied AI. I will be recruiting PhD students for Fall 2027. Motivated research interns, visiting students, and collaborators are also very welcome to reach out. More information: lsl.zone
Shilong Liu tweet media
English
38
15
586
55.1K
Yong Jae Lee
Yong Jae Lee@yong_jae_lee·
Great to be back in Madison last weekend for @yu_zhuoran32720’s PhD graduation! Zhuoran is my 9th PhD student and did really cool work on how data is processed in multimodal models + better ways to use synthetic & unlabeled data. Congrats again Zhuoran - see you in SF soon :)
Yong Jae Lee tweet media
English
4
2
52
5.6K
Mu Cai
Mu Cai@MuCai7·
My first share since joining @thinkymachines. Fun working with this team on real-time multimodal interaction. Vision in turn-based models felt like flipping through photos — continuous video is a different problem. Visual proactivity is essential — grateful to have worked on this alongside @liliyu_lili, @rown , and the rest of the team!
Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English
6
6
159
10.5K
Mu Cai retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…
English
460
1.9K
15.7K
7.6M
Yihe Deng
Yihe Deng@Yihe__Deng·
Last day at xAI. For a new grad, the past six months have been an irreplaceable experience. I feel fortunate to have made the decision to join this journey with xAI, and grateful for how much I was able to learn here in such a short, dense period of time. I'm proud of what we built, and what the multimodal team continues to build. I have deep faith in this team. No matter where I go next, I'll always look forward to seeing what my friends here pull off and bring into the world. I'm especially grateful to my captains along the way -- people I look up to, trust deeply, and who placed trust in my potential. I truly appreciate all the friends I met here, and the time we spent building together. And thanks xAI for the opportunity, and for giving me the space to learn, contribute, and grow. In the end, the greatest treasure is indeed the journey itself: the problems worth solving, and the people worth building with. Now, it's time to step into the uncertainty of what comes next.
English
49
5
549
36K
Mu Cai retweetledi
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable open models in the world! Gemma 4 is build to run on your hardware: phones, laptops, and desktops. Frontier intelligence with a 26B MOE and a 31B Dense model!
Logan Kilpatrick tweet media
English
287
593
6.2K
524.8K
Mu Cai
Mu Cai@MuCai7·
@CatGodSandHive Exactly! And this is why we think computer vision community has ignored this important direction: multiscale upon pixel space!
English
0
0
1
45
CatGod
CatGod@CatGodSandHive·
@MuCai7 So you're saying multiscale on pixels works better than on features? That's a plot twist, catnip for my curiosity, am I dreaming?
English
1
0
1
138
Mu Cai
Mu Cai@MuCai7·
🤯 Upgrade your pretrained visual encoder with <10 lines of code. This is what vision researchers have ignored: Can you imagine multiscale upon pixel space can work so well?! Remember, we are not doing multiscale upon feature space! 🏠Project Page: MuRF-VFM.github.io 📷 Paper: arxiv.org/abs/2603.25744 Get uniform improvements upon MLLM, Seg, Depth with similar computation cost.
Bocheng Zou@bochengzou

🔥 Upgrade your frozen vision encoders with <10 lines of code! Single-scale inference throws away vital details. Enter MuRF 🚀: a simple, training-free plug-in for instant, massive gains in MLLMs, Seg & Depth. 🤯 1/6

English
4
30
158
19.1K
Mu Cai
Mu Cai@MuCai7·
Good question, we have efficiency analysis in the paper! And it is straight forward: For MLLM: MuRF holds the same number of tokens as as single scale due to its design, leading to the same computation cost in LLM part. Empirically, we observed that MuRF achieves similar VRAM usuage, training and inference time compared to the single resolution for MLLM. The whole thing happens since visual encoder is much smaller than LLM!
English
0
0
2
95
JJJYmmm
JJJYmmm@JJJYmmm2002·
@MuCai7 any flops analysis? 🧐
English
1
0
0
160
Mu Cai
Mu Cai@MuCai7·
Hi Thomas, thanks for the comment! Huge fan of S² and learned upsamplers like AnyUp! 🤝 While we share the goal of multi-scale representation, MuRF takes a fundamentally different path. TL;DR: We show that simply resizing the whole image (no tiling!) and fusing features creates a universally stronger representation without any learned upsampling heuristics. Here is the deeper dive into why we are different: 1️⃣ Motivation & Token Budget: We asked: Does higher resolution always mean better features? Surprisingly, no! Low-res provides crucial global context that actually improves high-res performance. For MLLMs, we lift the performance ceiling by a large margin while keeping the exact same number of visual tokens! 2️⃣ Approach (No Tiling, No Bells & Whistles): Unlike S², which cuts images into independent patches (breaking spatial layout and object continuity), we process the entire image at different scales. No complex layout engineering. As for AnyUp, learned upsamplers are great, but our parameter-free bilinear upsampling requires zero training. This guarantees extreme simplicity, maximum flexibility, and prevents generalizability issues. 3️⃣ Universal Application: We aren't just optimizing MLLM token budgets. MuRF is a fundamental, training-free enhancement for visual representations—generalizing flawlessly out-of-the-box across high-level reasoning (MLLMs), dense geometry (Seg/Depth), and even unsupervised anomaly detection. We believe this simple, holistic multi-scale synergy is a highly promising direction. Let's push toward better visual representations together! 🚀
English
0
0
0
87
Thomas Wimmer
Thomas Wimmer@wimmer_th·
@MuCai7 github.com/bfshi/scaling_… Isn't that pretty much what Shi et al. did in ECCV 2024? You're upsampling bilinearly (why not use a feature-agnostic learned upsampler like AnyUp?) instead of downsampling before aggregation but that's about it, on first glance?
English
1
0
4
432
Mu Cai
Mu Cai@MuCai7·
Huge congrats to @bochengzou, who began working on this two years ago and made this magical technique happen!
English
0
0
4
617
Mu Cai retweetledi
Bocheng Zou
Bocheng Zou@bochengzou·
🔥 Upgrade your frozen vision encoders with <10 lines of code! Single-scale inference throws away vital details. Enter MuRF 🚀: a simple, training-free plug-in for instant, massive gains in MLLMs, Seg & Depth. 🤯 1/6
Bocheng Zou tweet media
English
7
26
147
28.3K