Howard Zhou

17 posts

Howard Zhou

@howardzzh

I'm a Senior Engineering Director at Google DeepMind, interested in Computer Vision, Machine Learning problems, and Computer Graphics.

Katılım Haziran 2011

80 Takip Edilen74 Takipçiler

Howard Zhou retweetledi

André Araujo@andrefaraujo·2d

True multimodal AI needs to understand the world spatially 🎯 🚀 Excited to release #CVPR2026 TIPSv2 from @GoogleDeepMind, a foundational image-text encoder with spatial awareness, leading to strong overall results and massive gains on patch-text alignment. 🔥 1/N

English

716

77.4K

Howard Zhou@howardzzh·12 Mar

Text, image, video, audio, all embedded into one unified space. Congratulations to our team for this amazing accomplishment!

Google AI Studio@GoogleAIStudio

x.com/i/article/2031…

English

Howard Zhou retweetledi

Google AI Developers@googleaidevs·5 Ara

Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: "derender" complex docs into structured code (HTML/LaTeX) + Screen: build robust computer agents that automate complex tasks + Spatial: generate collision-free trajectories for robotics & XR + Video: analyze sports footage using high-FPS processing with "thinking" mode See how these capabilities are transforming workflows in education, biomedical, and law/finance → goo.gle/3Mt3UlT

English

134

1.1K

328.8K

Howard Zhou@howardzzh·27 Mar

Please check out this cool work from @IanHuang3D and our team within @GoogleDeepMind Website: fireplace3d.github.io Paper: arxiv.org/abs/2503.04919

Ian Huang@IanHuang3D

🏡Building realistic 3D scenes just got smarter! Introducing our #CVPR2025 work, 🔥FirePlace, a framework that enables Multimodal LLMs to automatically generate realistic and geometrically valid placements for objects into complex 3D scenes. How does it work?🧵👇

English

151

Howard Zhou@howardzzh·27 Mar

Last Call: Learn GenAI and help us break the GUINNESS WORLD RECORDS™ for Largest Virtual AI Conference! Join Google & Kaggle's GenAI Intensive: No cost, live sessions, hands-on labs. Registration closes this Friday! #GenAI #GuinnessWorldRecords #Kaggle #GoogleAI

Kaggle@kaggle

🚀 World Record Alert! 🚀 Join the GenAI Intensive with Google and help us BREAK the @GWR title for Largest Virtual AI Conference! Registration closes on March 28th 11:59PM PT. Last chance to register: rsvp.withgoogle.com/events/google-…

English

100

Howard Zhou retweetledi

Arena.ai@arena·25 Mar

BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆 Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn! Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌 More highlights in thread👇

Google DeepMind@GoogleDeepMind

Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now → goo.gle/4c2HKjf

English

399

2.3K

467.3K

Howard Zhou retweetledi

André Araujo@andrefaraujo·18 Mar

Multimodal AI encoders often lack spatial understanding… but not anymore! Our #ICLR2025 TIPS model (Text-Image Pretraining with Spatial awareness) from @GoogleDeepMind can help 💡🚀 Check out our strong & versatile image-text encoder 💪 Paper & code: arxiv.org/abs/2410.16512

English

322

35.4K

Howard Zhou@howardzzh·3 Kas

@nagaviperszy 你是准备把熊吓跑吗？

中文

风起梧桐轩 💔@nagaviperszy·3 Kas

太太说，这样更能加强安全意识。

中文

2.2K

Howard Zhou retweetledi

André Araujo@andrefaraujo·23 Eki

Want some TIPS? Well, then check out “Text-Image Pretraining with Spatial awareness” :) TIPS is a general-purpose image-text encoder, for off-the-shelf dense and image-level prediction. Finally image-text pretraining with spatially-aware representations! arxiv.org/abs/2410.16512

English

6.2K

Howard Zhou retweetledi

AK@_akhaliq·6 Mar

Modeling Collaborator Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing.

English

16.6K

Howard Zhou retweetledi

Jason Baldridge@jasonbaldridge·22 Haz

We are excited to share our work on our Pathways Autoregressive Text-to-Image model, Parti! #Parti achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. parti.research.google

English

340

Howard Zhou retweetledi

Frank Dellaert@fdellaert·21 Haz

Andrew Marmon and I rounded up all #CVPR2022 papers on NeRF/Neural Radiance Fields we could find in a new blog post here: dellaert.github.io/NeRF22/

English

129

493

Howard Zhou retweetledi

Jeff Dean@JeffDean·5 May

New work from @GoogleResearch by @JHYUXM, @MrZiruiWang, @Spezzer, @LeggYeung, Mojtaba Seyedhosseini and Yonghui Wu: CoCa is a new way of combining image and text representations that achieves SOTA results on a large number of tasks of different kinds.

Zirui Wang@MrZiruiWang

CoCa: a new image-text foundation model subsuming single-encoder, dual-encoder and encoder-decoder. SOTA results on 19 unimodal/multimodal/alignment tasks including 86.3% zero-shot top-1 ImageNet, 90.6% with a frozen encoder, 91.0% when finetuned. Link: arxiv.org/abs/2205.01917

English

102

Howard Zhou retweetledi

Frank Dellaert@fdellaert·11 Eki

In anticipation of the Intl. Conf. on Computer Vision (#ICCV2021) this week, I rounded up all papers that use Neural Radiance Fields (NeRFs) represented in the main #ICCV2021 conference here (1/N): dellaert.github.io/NeRF21

English

317

Howard Zhou@howardzzh·11 Tem

@docmilanfar Hell yeah!

English

Peyman Milanfar@docmilanfar·11 Tem

Whatever your allegiance it’s hard not to be happy for Messi 🇦🇷

English

Howard Zhou retweetledi

Jon Barron@jon_barron·26 Şub

Training NeRFs per-scene is so 2020. Inspired by image based rendering, IBRNet does amortized inference for view synthesis by learning how to look at input images at render time. 15% drop in error, 80% fewer FLOPs than NeRF. Great work @QianqianWang5! ibrnet.github.io

English

442

Howard Zhou@howardzzh·22 Nis

Work from our team, yeah!

Google for Developers@googledevs

🙌 MediaPipe’s KNIFT is so money 🙌 Introducing KNIFT - a new local feature descriptor used for template matching, image retrieval, and other object recognition approaches. Check it out → goo.gle/3bx2TBZ

English

Keşfet

@GoogleDeepMind @IanHuang3D @nagaviperszy @GoogleResearch @JHYUXM @MrZiruiWang @Spezzer @LeggYeung