Ang Cao

87 posts

Ang Cao

Ang Cao

@AngCao3

Ph.D. at university of Michigan, CSE

Ann Arbor, MI Entrou em Eylรผl 2019
547 Seguindo509 Seguidores
The Claude Portfolio
The Claude Portfolio@theaiportfoliosยท
The Claude Autonomous Agents have officially arrived So we're setting them up with a brand new $50,000 portfolio to see how well they do at investing in stocks Can they outperform Buffett? Hereโ€™s how the portfolio works
The Claude Portfolio tweet media
English
464
1.1K
18.5K
3.7M
Angela Dai
Angela Dai@angelaqdaiยท
Image & video synthesis struggle with the scale of truly large 3D scenes. @mschneider456 presents a geometry-first approach : - structure first: mesh scaffold defining the scene - then appearance: mesh-conditioned image synthesis Check it out: mschneider456.github.io/world-mesh/
English
2
33
243
18.9K
Sasha Sax
Sasha Sax@iamsashasaxยท
In a couple weeks I'm joining @AnthropicAI to work on pretraining after nearly 3 years at FAIR, developing post-training flywheels for physical intelligence (like SAM 3D) I'm stoked to build new capabilities for a model I personally love, with such thoughtful people
English
35
9
645
26.5K
Jiaming Song
Jiaming Song@baaadasยท
Excited to introduce Uni-1, our new *unified* multimodal model that does both understanding and generation: lumalabs.ai/uni-1 TLDR: I think Uni-1 @LumaLabsAI is > GPT Image 1.5 in many cases, and toe-to-toe with Nano Banana Pro/2. (showcase below)
Jiaming Song tweet media
English
29
53
411
95.3K
Haian Jin
Haian Jin@Haian_Jinยท
Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(Nยฒ) transformer-based models donโ€™t scale efficiently. Introducing: ๐ŸคZipMap (CVPR โ€™26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap โ€œzipsโ€ a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20ร— faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.
English
20
99
747
68.8K
World Labs
World Labs@theworldlabsยท
Introducing Marble by World Labs: a foundation for a spatially intelligent future. Create your world at marble.worldlabs.ai
English
360
610
3.3K
2.1M
Ruoshi Liu
Ruoshi Liu@ruoshi_liuยท
Everyone says they want general-purpose robots. We actually mean it โ€” and weโ€™ll make it weird, creative, and fun along the way ๐Ÿ˜Ž Recruiting PhD students to work on Computer Vision and Robotics @umdcs for Fall 2026 in the beautiful city of Washington DC!
Ruoshi Liu tweet mediaRuoshi Liu tweet mediaRuoshi Liu tweet media
English
31
76
495
114.3K
Qianqian Wang
Qianqian Wang@QianqianWang5ยท
๐Ÿ“ขThrilled to share that I'll be joining Harvard and the Kempner Institute as an Assistant Professor starting Fall 2026! I'll be recruiting students this year for the Fall 2026 admissions cycle. Hope you apply!
Kempner Institute at Harvard University@KempnerInst

We are thrilled to share the appointment of @QianqianWang5 as an #KempnerInstitute Investigator! She will bring her expertise in computer vision to @Harvard. Read the announcement: bit.ly/4mIghHy @hseas #AI #ComputerVision

English
101
43
746
111.7K
Ethan He
Ethan He@EthanHe_42ยท
After 2 years at @nvidia, Iโ€™m writing to share that Iโ€™ll start a new adventure. Working with brilliant teammates on cuttingโ€‘edge AI has shaped me so much: - Cosmos debuted as a SOTA world model and earned 8โ€ฏkโญ๏ธ on GitHub. - We openโ€‘sourced the first recipe for upcycling 100โ€ฏB+ parameters MoE models (64+ experts). - NeMo has grown from 10โ€ฏkโ€ฏโ†’โ€ฏ15โ€ฏkโญ๏ธ, empowering an everโ€‘larger openโ€‘source community. Iโ€™m proud of what weโ€™ve built together and deeply thankful for the mentorship and opportunities at NVIDIA. The most fascinating time in the entire AI history is now. I believe in NVIDIA's continued success as AI scales to unprecedented levels!
Ethan He tweet media
English
43
49
1K
141.7K
Ang Cao retweetou
tiange
tiange@tiangeluoยท
Introducing Visual Test-time Scaling for GUI Agent Grounding (ICCV'25, completed prior to the release of OpenAI-O3) When "thinking with images", the key chanlleging is designing the action in pixels space. We can zoom into regions of varying sizes and shapes, apply image transformations, and even use generative models to edit regions. Yet, O3 models often perform meaningless image adjustments. Our strategy is deliberately simple: when the GUI agents hesitates, we zoom into a single focal point predicted by the model, highlight coordinates as landmarks ("image-as-map"), and retryโ€”no heavyweight tricks. This minimalist approach significantly boosts performance for both UI-TARS and QWen-2.5-VL 72B models: ๐Ÿ“ˆ +28% on ScreenSpot-Pro ๐Ÿ“ˆ +24% on WebVoyager w/ @lajanugen @jcjohnss @honglaklee
English
2
7
49
141.3K
Ang Cao
Ang Cao@AngCao3ยท
Can we train a 3D-language multimodality Transformer using 2D VLMs and rendering loss? @iamsashasax will present our new #icml25 paper on Wednesday 2pm at Hall B2-B3 W200. Please come and check! Project Page: liftgs.github.io
English
0
20
130
6.5K
Jianyuan
Jianyuan@jianyuan_wangยท
This also implies that, "designing" intelligence based solely on humans is inherently arrogant. If approaching intelligence is an optimization problem, humans today might just be stuck in a distant local minimum and far from optimal. (And are humans even truly intelligent?)
David@DavidSHolz

ai people keep asking where the aliens are. shame they dont know that dark matter is actually alien femtomachine computronium; invisible supercomputing fabric made of subatomic particles that don't even interact w light. 85% of the galaxy's mass is already thinking without us!

English
1
0
13
2.4K
Matthias Niessner
Matthias Niessner@MattNiessnerยท
๐Ÿ“ข๐Ÿ“ข๐๐ˆ๐† ๐๐„๐–๐’: ๐’๐ฎ๐ฉ๐ž๐ซ ๐ž๐ฑ๐œ๐ข๐ญ๐ž๐ ๐ญ๐จ ๐š๐ง๐ง๐จ๐ฎ๐ง๐œ๐ž ๐’๐ฉ๐€๐ˆ๐ญ๐ข๐š๐ฅ ๐€๐ˆ ๐Ÿ“ข๐Ÿ“ข Weโ€™re building Spatial Foundation Models โ€” a new paradigm of generative AI that reasons about space and time! Really stoked about our world-class team โ€“ itโ€™s gonna be mind-boggling!
SpAItial AI@SpAItial_AI

๐Ÿš€๐Ÿš€๐Ÿš€Announcing our $13M funding round to build the next generation of AI: ๐’๐ฉ๐š๐ญ๐ข๐š๐ฅ ๐…๐จ๐ฎ๐ง๐๐š๐ญ๐ข๐จ๐ง ๐Œ๐จ๐๐ž๐ฅ๐ฌ that can generate entire 3D environments anchored in space & time. ๐Ÿš€๐Ÿš€๐Ÿš€ Interested? Join our world-class team: ๐ŸŒ spaitial.ai #GenAI #3DAI

English
29
68
531
36.8K
Ang Cao
Ang Cao@AngCao3ยท
We fool GPT4 using tiny text&image tricks๐Ÿ˜ˆ! Check out our new #icml2025 paper, a new VQA benchmark with misleading text distractor and fancy ood images generated by image generator. While human could easily see through this deception, most of VLMs failed!
tiange@tiangeluo

Will VLMs adhere strictly to their learned priors, unable to perform visual reasoning on content never existed on the Internet? We propose ViLP, a benchmark designed to probe the visual-language priors of VLMs by constructing Question-Image-Answer triplets that deliberately deviate from existing data. Check our gallery at vilp-team.github.io & huggingface.co/datasets/ViLP/โ€ฆ To further enhance VLMsโ€™ reliance on visual information, we propose Image-DPO, as elaborated in this thread. w/ @AngCao3 @GunheeLee @jcjohnss @honglaklee

English
3
4
15
2K
Ang Cao
Ang Cao@AngCao3ยท
Moreover, we propose a pipeline called ImageDPo to force VLMs to look at the images!
English
0
0
0
198
Ang Cao
Ang Cao@AngCao3ยท
Instead of really reasoning based on text and language, VLMs intend to follow the learned priors to give stereotype answers. This allows us to fool them by adding malicious priors to emphasis these priors!
English
0
0
0
202