Delegate Chao Wu
2.7K posts

Delegate Chao Wu
@Wu4Delegate
Data Scientist, Maryland State Delegate District 9A (Howard and Montgomery Counties),By Authority of Friends to Elect Chao Wu, Treasurer: Xia Chen.



So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without fundamentally addressing perception, especially mid- and low-level vision. (Credit: Partly inspired by separate conversations with @xiangyue96 and @YutongBAI1002) As humans, we clearly exhibit pre-linguistic roots in our intuitive physical and psychological understanding, e.g., basic principles like solidity, continuity, and gravity. After we built GroundHog (arxiv.org/abs/2402.16846) in 2024, I took a moment to reflect on the core issues with VLMs. I can no longer convince myself that simply stacking CLIP and DINO with a few projection layers is the ultimate solution to "tokenize" vision. Vision–language models need a much stronger vision foundation, perhaps a fundamental restart from a vision-centric perspective. That’s why I stepped away from VLM development for a year to explore alternatives. A paper @TairanHe99 shared in this thread (led by the brilliant @TongPetersb) was especially thought-provoking. But to truly start over, I began looking into 3D foundation models and video diffusion models, setting aside, for now, the possibility of joint vision–language diffusion models. This led me to take the risk of developing 4D-LRM (arxiv.org/abs/2506.18890), aiming to learn 4D priors at scale with absolutely no language prior. This is only a first step. At some point, I plan to return to VLM engineering. But next time, I hope I have resources to start with a world model first and then unlock the language component on top of it.

Speed cameras on the I-83 Jones Falls Expressway have issued more than $18.5 million in fines in the past three years, but about 80% of the revenue has gone to the camera vendor, Verra Mobility — not the city, according to the Baltimore City Department of Finance. bit.ly/3Tu4jVu



















