Antoine Yang

160 posts

Antoine Yang

@AntoineYang2

Senior Research Scientist @GoogleDeepMind, Gemini video 💎. Prev: PhD @Inria & @ENS_ULM, MEng @Polytechnique.

London, England Katılım Kasım 2019

482 Takip Edilen1.5K Takipçiler

Sabitlenmiş Tweet

Antoine Yang@AntoineYang2·9 May

Thrilled to share our latest advances in video understanding 📽️: Gemini 2.5 Pro is a truly magical model to play with, excelling in traditional video analysis and unlocking new use cases I could not imagine a few months ago🪄 More in 🧵 and @Google blog: developers.googleblog.com/en/gemini-2-5-…

English

374

125.2K

Antoine Yang retweetledi

AK@_akhaliq·5d

Ego2Web A Web Agent Benchmark Grounded in Egocentric Videos paper: huggingface.co/papers/2603.22…

English

16.4K

Antoine Yang retweetledi

Mohit Bansal@mohitban47·4d

🚨 New #CVPR2026 collaboration with Google DeepMind --> Ego2Web bridges egocentric video perception and web execution, enabling agents that see the first-person real-world video of the user’s surroundings, and take actions on the web grounded in the egocentric video: ▪️ Introduces a task where agents must ground egocentric video (first-person view) into concrete web actions (requires visual grounding → entity extraction → planning → real website execution). ▪️Covers realistic cross-domain tasks e.g., e-commerce (find/buy items you saw), media retrieval (find related videos), knowledge lookup (identify & query entities), maps/local (locate places from visual cues). ▪️Proposes Ego2WebJudge to automatically evaluate whether web agent results are correctly grounded in the video context. ▪️Reveals concrete failure modes across 6 strong agents (GPT-5.4, Claude, Gemini-based agents, etc.): weak visual grounding, brittle cross-modal reasoning, and planning breakdowns (only ~58% success rate). Details 👇👇

Shoubin Yu@shoubin621

Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026. AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world. Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video. This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents. 🧵👇

English

5.8K

Antoine Yang@AntoineYang2·5d

Check the Ego2Web benchmark if you're interested in agents interacting with the Web while being visually grounded in egocentric videos 📽️

Shoubin Yu@shoubin621

English

2.2K

Antoine Yang@AntoineYang2·17 Mar

Watch Gemini 3.1 Flash-Lite answer VideoMMMU questions 8x faster than 2.5 Flash with +5% accuracy ⚡️

Google DeepMind@GoogleDeepMind

Gemini 3.1 Flash-Lite has landed. It’s our most cost-efficient Gemini 3 series model yet, built for intelligence at scale. Here’s what’s new 🧵

English

1.1K

Antoine Yang retweetledi

Google Gemini@GeminiApp·28 Oca

We partnered with @ICC to show how Gemini 3 Pro can analyze video content. By uploading a segment of the Cricket World Cup, Gemini can seamlessly process visual and audio data to identify key players, explain techniques, and highlight crucial turning points. 🏏

English

151

246

2.6K

4.3M

Antoine Yang retweetledi

Fei Xia@xf1280·18 Ara

🚀Excited to share that #Gemini 3 Flash can do code execution on images to zoom, count, and annotate visual inputs! The model can choose when to write code to: 🔍 Zoom & Inspect: Detect when details are too small and zoom-in. 🧮 Compute Visually: Run multi-step calculations using code (e.g., summing line items on a receipt). ✏️ Annotate: Draw arrows or bounding boxes to answer questions or show relationships between objects.

English

18.5K

Antoine Yang@AntoineYang2·18 Ara

Frontier video understanding performance (e.g. higher VideoMMMU accuracy than GPT 5.2) and much more available at lightning speed 😀

Google DeepMind@GoogleDeepMind

3 Flash delivers frontier performance on benchmarks like GPQA Diamond - evaluating PhD-level reasoning – and Humanity’s Last Exam – testing broad expert knowledge. It’s state-of-the-art on MMMU Pro, with a score comparable to 3 Pro - easily analyzing inputs across videos and images, not just text. And it handles complex tasks significantly faster than 2.5 Pro at a lower cost, using fewer tokens - or units of information - to save time.

English

584

Antoine Yang retweetledi

Wes Roth@WesRoth·8 Ara

Demis Hassabis says the most ignored marvel is AI’s ability to understand video, images, and audio together. Gemini can watch a movie scene and explain the symbolism behind a tiny gesture. This shows the model grasps concepts, not just pixels or words. Such deep cross-media reasoning is still under-appreciated outside AI circles.

English

108

905

53.8K

Antoine Yang retweetledi

Google AI Developers@googleaidevs·5 Ara

Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: "derender" complex docs into structured code (HTML/LaTeX) + Screen: build robust computer agents that automate complex tasks + Spatial: generate collision-free trajectories for robotics & XR + Video: analyze sports footage using high-FPS processing with "thinking" mode See how these capabilities are transforming workflows in education, biomedical, and law/finance → goo.gle/3Mt3UlT

English

134

1.1K

328.5K

Antoine Yang retweetledi

Chubby♨️@kimmonismus·2 Ara

Google cooked so hard. Not gonna lie, this feels like the future is here. Now develop Google Glasses with enough battery power, a good chip, and a look like Ray-Bans, and you'll have an instant hit. 100%.

English

481

2.1K

17.5K

3.1M

Antoine Yang retweetledi

AshutoshShrivastava@ai_for_success·25 Kas

I will go first: Gemini - Able to upload videos and work with them (very important for me) - Gemini Live is awesome - Access to NotebookLM - Nano Banana Pro - Love Gemini Deep Research - 2TB Storage not a priority but it's good addition

English

249

13.5K

Antoine Yang retweetledi

Google@Google·20 Kas

Introducing Nano Banana Pro (Gemini 3 Pro Image), our new state-of-the-art image generation and editing model from @GoogleDeepMind. It improves on the original model while adding new advanced capabilities, enhanced world knowledge and text rendering, allowing you to create and edit studio-quality, production-ready visuals.

English

386

835

4.5K

1.9M

Antoine Yang retweetledi

koray kavukcuoglu@koraykv·18 Kas

2/4 It’s our most intelligent model yet: 🏆Tops LMArena Leaderboard at 1501 Elo and WebDev with 1487 Elo 📝 New SOTA in reasoning on Humanity’s Last Exam (37.5% w/o tools) and GPQA Diamond (91.9%) 🧠 Scores an industry-leading 31.1% on ARC-AGI-2 🎨 Breakthrough scores in multimodality, with 81% on MMMU-Pro and 87.6% on Video MMMU

English

136

29.9K

Antoine Yang retweetledi

Google DeepMind@GoogleDeepMind·18 Kas

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

English

213

1.1K

6.5K

1.7M

Antoine Yang@AntoineYang2·18 Kas

🫡

Google DeepMind@GoogleDeepMind

QME

503

Antoine Yang retweetledi

Ani Baddepudi@AniBaddepudi·8 Ağu

gemini's still the only frontier model that supports native video input (and is amazing at it!) incredible amount of real-world utility given how much of the world's information is increasingly in video

English

297

20.8K

Antoine Yang retweetledi

Andi Marafioti@andimarafioti·23 Tem

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

English

3.3K

Antoine Yang@AntoineYang2·26 Haz

RT @Google: The plot thickens. 🕵️ Use Gemini 2.5 Pro to turn random videos from your camera roll into a dramatic narrative. Try it yoursel…

English

Antoine Yang retweetledi

Logan Kilpatrick@OfficialLoganK·19 Haz

We just shipped video FPS support in the Gemini API, so you can dynamically customize how many frames per second you want the model to see, unlocking lots of interesting new video use cases! 📹

English

814

66.9K

Antoine Yang@AntoineYang2·18 Haz

@ai_andrew This seems to be a parsing or instruction following issue? We do not look at test set results in long context settings often. For quite a few of these benchmarks, these are the best numbers we've measured, and this will remain stable now as it is GA 🙂

English

⟁ndrew V@AI_Andrew·17 Haz

Seriously, even better? Because there was a checkpoint earlier that seemed to be pretty spectacular and then a degraded so is it back to as good as it used to be or is it even better than ever? not trying to be a troll or anything like that just curious because when the model changes it changes the json structured output and I have to figure out a new schema. Or some other adjustment to accommodate. It was fantastic when we could have a context of 2 million for the longer videos or a collection of many short videos. I understand that time is in the past, but yeah, it’s better than ever or as good as it was before?

English

Antoine Yang@AntoineYang2·17 Haz

The newly generally available Gemini 2.5 Flash and Pro are even better at video understanding than the versions we shared in the blog a month ago, see more details in the tech report 😀

Google DeepMind@GoogleDeepMind

Hot Gemini updates off the press. 🚀 Anyone can now use 2.5 Flash and Pro to build and scale production-ready AI applications. 🙌 We’re also launching 2.5 Flash-Lite in preview: the fastest model in the 2.5 family to respond to requests, with the lowest cost too. 🧵

English

104

12.4K

Keşfet

@ICC @GoogleDeepMind @Google @ai_andrew @elonmusk @BarackObama @taylorswift13 @cristiano