Ang Cao

87 posts

Ang Cao

Ang Cao

@AngCao3

Ph.D. at university of Michigan, CSE

Ann Arbor, MI شامل ہوئے Eylül 2019
547 فالونگ509 فالوورز
The Claude Portfolio
The Claude Portfolio@theaiportfolios·
The Claude Autonomous Agents have officially arrived So we're setting them up with a brand new $50,000 portfolio to see how well they do at investing in stocks Can they outperform Buffett? Here’s how the portfolio works
The Claude Portfolio tweet media
English
464
1.1K
18.5K
3.7M
Angela Dai
Angela Dai@angelaqdai·
Image & video synthesis struggle with the scale of truly large 3D scenes. @mschneider456 presents a geometry-first approach : - structure first: mesh scaffold defining the scene - then appearance: mesh-conditioned image synthesis Check it out: mschneider456.github.io/world-mesh/
English
2
33
243
18.9K
Sasha Sax
Sasha Sax@iamsashasax·
In a couple weeks I'm joining @AnthropicAI to work on pretraining after nearly 3 years at FAIR, developing post-training flywheels for physical intelligence (like SAM 3D) I'm stoked to build new capabilities for a model I personally love, with such thoughtful people
English
35
9
645
26.5K
Jiaming Song
Jiaming Song@baaadas·
Excited to introduce Uni-1, our new *unified* multimodal model that does both understanding and generation: lumalabs.ai/uni-1 TLDR: I think Uni-1 @LumaLabsAI is > GPT Image 1.5 in many cases, and toe-to-toe with Nano Banana Pro/2. (showcase below)
Jiaming Song tweet media
English
29
53
411
95.3K
Haian Jin
Haian Jin@Haian_Jin·
Spatial reconstruction is a long-context problem: real scenes come with hundreds of images. But O(N²) transformer-based models don’t scale efficiently. Introducing: 🤐ZipMap (CVPR ’26): Linear-Time, Stateful 3D Reconstruction via Test-Time Training (TTT). ZipMap “zips” a large image collection into an implicit TTT scene state in a single linear-time operation. The state will then be decoded into spatial outputs, and can be queried efficiently for novel-view geometry and appearance (~100 FPS) ZipMap is not only much faster (>20× faster than VGGT), but also matches or surpasses the accuracy of all SOTA models.
English
20
99
747
68.8K
World Labs
World Labs@theworldlabs·
Introducing Marble by World Labs: a foundation for a spatially intelligent future. Create your world at marble.worldlabs.ai
English
360
610
3.3K
2.1M
Ruoshi Liu
Ruoshi Liu@ruoshi_liu·
Everyone says they want general-purpose robots. We actually mean it — and we’ll make it weird, creative, and fun along the way 😎 Recruiting PhD students to work on Computer Vision and Robotics @umdcs for Fall 2026 in the beautiful city of Washington DC!
Ruoshi Liu tweet mediaRuoshi Liu tweet mediaRuoshi Liu tweet media
English
31
76
495
114.3K
Qianqian Wang
Qianqian Wang@QianqianWang5·
📢Thrilled to share that I'll be joining Harvard and the Kempner Institute as an Assistant Professor starting Fall 2026! I'll be recruiting students this year for the Fall 2026 admissions cycle. Hope you apply!
Kempner Institute at Harvard University@KempnerInst

We are thrilled to share the appointment of @QianqianWang5 as an #KempnerInstitute Investigator! She will bring her expertise in computer vision to @Harvard. Read the announcement: bit.ly/4mIghHy @hseas #AI #ComputerVision

English
101
43
746
111.7K
Ethan He
Ethan He@EthanHe_42·
After 2 years at @nvidia, I’m writing to share that I’ll start a new adventure. Working with brilliant teammates on cutting‑edge AI has shaped me so much: - Cosmos debuted as a SOTA world model and earned 8 k⭐️ on GitHub. - We open‑sourced the first recipe for upcycling 100 B+ parameters MoE models (64+ experts). - NeMo has grown from 10 k → 15 k⭐️, empowering an ever‑larger open‑source community. I’m proud of what we’ve built together and deeply thankful for the mentorship and opportunities at NVIDIA. The most fascinating time in the entire AI history is now. I believe in NVIDIA's continued success as AI scales to unprecedented levels!
Ethan He tweet media
English
43
49
1K
141.7K
Ang Cao ری ٹویٹ کیا
tiange
tiange@tiangeluo·
Introducing Visual Test-time Scaling for GUI Agent Grounding (ICCV'25, completed prior to the release of OpenAI-O3) When "thinking with images", the key chanlleging is designing the action in pixels space. We can zoom into regions of varying sizes and shapes, apply image transformations, and even use generative models to edit regions. Yet, O3 models often perform meaningless image adjustments. Our strategy is deliberately simple: when the GUI agents hesitates, we zoom into a single focal point predicted by the model, highlight coordinates as landmarks ("image-as-map"), and retry—no heavyweight tricks. This minimalist approach significantly boosts performance for both UI-TARS and QWen-2.5-VL 72B models: 📈 +28% on ScreenSpot-Pro 📈 +24% on WebVoyager w/ @lajanugen @jcjohnss @honglaklee
English
2
7
49
141.3K
Ang Cao
Ang Cao@AngCao3·
Can we train a 3D-language multimodality Transformer using 2D VLMs and rendering loss? @iamsashasax will present our new #icml25 paper on Wednesday 2pm at Hall B2-B3 W200. Please come and check! Project Page: liftgs.github.io
English
0
20
130
6.5K
Jianyuan
Jianyuan@jianyuan_wang·
This also implies that, "designing" intelligence based solely on humans is inherently arrogant. If approaching intelligence is an optimization problem, humans today might just be stuck in a distant local minimum and far from optimal. (And are humans even truly intelligent?)
David@DavidSHolz

ai people keep asking where the aliens are. shame they dont know that dark matter is actually alien femtomachine computronium; invisible supercomputing fabric made of subatomic particles that don't even interact w light. 85% of the galaxy's mass is already thinking without us!

English
1
0
13
2.4K
Matthias Niessner
Matthias Niessner@MattNiessner·
📢📢𝐁𝐈𝐆 𝐍𝐄𝐖𝐒: 𝐒𝐮𝐩𝐞𝐫 𝐞𝐱𝐜𝐢𝐭𝐞𝐝 𝐭𝐨 𝐚𝐧𝐧𝐨𝐮𝐧𝐜𝐞 𝐒𝐩𝐀𝐈𝐭𝐢𝐚𝐥 𝐀𝐈 📢📢 We’re building Spatial Foundation Models — a new paradigm of generative AI that reasons about space and time! Really stoked about our world-class team – it’s gonna be mind-boggling!
SpAItial AI@SpAItial_AI

🚀🚀🚀Announcing our $13M funding round to build the next generation of AI: 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 that can generate entire 3D environments anchored in space & time. 🚀🚀🚀 Interested? Join our world-class team: 🌍 spaitial.ai #GenAI #3DAI

English
29
68
531
36.8K
David Novotny
David Novotny@davnov134·
🚀 The next generation of AI models must go beyond creating pixels - it needs to build and understand worlds. At SpAItial, we're developing foundation models that are physically grounded by design. Excited to share what we've been building with an incredible team!
SpAItial AI@SpAItial_AI

🚀🚀🚀Announcing our $13M funding round to build the next generation of AI: 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 that can generate entire 3D environments anchored in space & time. 🚀🚀🚀 Interested? Join our world-class team: 🌍 spaitial.ai #GenAI #3DAI

English
5
6
113
11.1K
Ang Cao
Ang Cao@AngCao3·
We fool GPT4 using tiny text&image tricks😈! Check out our new #icml2025 paper, a new VQA benchmark with misleading text distractor and fancy ood images generated by image generator. While human could easily see through this deception, most of VLMs failed!
tiange@tiangeluo

Will VLMs adhere strictly to their learned priors, unable to perform visual reasoning on content never existed on the Internet? We propose ViLP, a benchmark designed to probe the visual-language priors of VLMs by constructing Question-Image-Answer triplets that deliberately deviate from existing data. Check our gallery at vilp-team.github.io & huggingface.co/datasets/ViLP/… To further enhance VLMs’ reliance on visual information, we propose Image-DPO, as elaborated in this thread. w/ @AngCao3 @GunheeLee @jcjohnss @honglaklee

English
3
4
15
2K
Ang Cao
Ang Cao@AngCao3·
Moreover, we propose a pipeline called ImageDPo to force VLMs to look at the images!
English
0
0
0
198
Ang Cao
Ang Cao@AngCao3·
Instead of really reasoning based on text and language, VLMs intend to follow the learned priors to give stereotype answers. This allows us to fool them by adding malicious priors to emphasis these priors!
English
0
0
0
202