Antoine Yang

160 posts

Antoine Yang banner
Antoine Yang

Antoine Yang

@AntoineYang2

Senior Research Scientist @GoogleDeepMind, Gemini video 💎. Prev: PhD @Inria & @ENS_ULM, MEng @Polytechnique.

London, England Katılım Kasım 2019
482 Takip Edilen1.5K Takipçiler
Sabitlenmiş Tweet
Antoine Yang
Antoine Yang@AntoineYang2·
Thrilled to share our latest advances in video understanding 📽️: Gemini 2.5 Pro is a truly magical model to play with, excelling in traditional video analysis and unlocking new use cases I could not imagine a few months ago🪄 More in 🧵 and @Google blog: developers.googleblog.com/en/gemini-2-5-…
English
11
49
374
125.2K
Antoine Yang retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 New #CVPR2026 collaboration with Google DeepMind --> Ego2Web bridges egocentric video perception and web execution, enabling agents that see the first-person real-world video of the user’s surroundings, and take actions on the web grounded in the egocentric video: ▪️ Introduces a task where agents must ground egocentric video (first-person view) into concrete web actions (requires visual grounding → entity extraction → planning → real website execution). ▪️Covers realistic cross-domain tasks e.g., e-commerce (find/buy items you saw), media retrieval (find related videos), knowledge lookup (identify & query entities), maps/local (locate places from visual cues). ▪️Proposes Ego2WebJudge to automatically evaluate whether web agent results are correctly grounded in the video context. ▪️Reveals concrete failure modes across 6 strong agents (GPT-5.4, Claude, Gemini-based agents, etc.): weak visual grounding, brittle cross-modal reasoning, and planning breakdowns (only ~58% success rate). Details 👇👇
Shoubin Yu@shoubin621

Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026. AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world. Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video. This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents. 🧵👇

English
0
10
34
5.8K
Antoine Yang
Antoine Yang@AntoineYang2·
Check the Ego2Web benchmark if you're interested in agents interacting with the Web while being visually grounded in egocentric videos 📽️
Shoubin Yu@shoubin621

Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026. AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world. Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video. This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents. 🧵👇

English
0
4
8
2.2K
Antoine Yang retweetledi
Google Gemini
Google Gemini@GeminiApp·
We partnered with @ICC to show how Gemini 3 Pro can analyze video content. By uploading a segment of the Cricket World Cup, Gemini can seamlessly process visual and audio data to identify key players, explain techniques, and highlight crucial turning points. 🏏
English
151
246
2.6K
4.3M
Antoine Yang retweetledi
Fei Xia
Fei Xia@xf1280·
🚀Excited to share that #Gemini 3 Flash can do code execution on images to zoom, count, and annotate visual inputs! The model can choose when to write code to: 🔍 Zoom & Inspect: Detect when details are too small and zoom-in. 🧮 Compute Visually: Run multi-step calculations using code (e.g., summing line items on a receipt). ✏️ Annotate: Draw arrows or bounding boxes to answer questions or show relationships between objects.
English
6
16
89
18.5K
Antoine Yang retweetledi
Wes Roth
Wes Roth@WesRoth·
Demis Hassabis says the most ignored marvel is AI’s ability to understand video, images, and audio together. Gemini can watch a movie scene and explain the symbolism behind a tiny gesture. This shows the model grasps concepts, not just pixels or words. Such deep cross-media reasoning is still under-appreciated outside AI circles.
English
40
108
905
53.8K
Antoine Yang retweetledi
Google AI Developers
Google AI Developers@googleaidevs·
Gemini 3 Pro is the frontier of multimodal AI, delivering SOTA performance across document, screen, spatial, and video understanding. Read our deep dive on how we’ve pushed our core capabilities to power hero use cases across: + Docs: "derender" complex docs into structured code (HTML/LaTeX) + Screen: build robust computer agents that automate complex tasks + Spatial: generate collision-free trajectories for robotics & XR + Video: analyze sports footage using high-FPS processing with "thinking" mode See how these capabilities are transforming workflows in education, biomedical, and law/finance → goo.gle/3Mt3UlT
Google AI Developers tweet media
English
45
134
1.1K
328.5K
Antoine Yang retweetledi
Chubby♨️
Chubby♨️@kimmonismus·
Google cooked so hard. Not gonna lie, this feels like the future is here. Now develop Google Glasses with enough battery power, a good chip, and a look like Ray-Bans, and you'll have an instant hit. 100%.
English
481
2.1K
17.5K
3.1M
Antoine Yang retweetledi
AshutoshShrivastava
AshutoshShrivastava@ai_for_success·
I will go first: Gemini - Able to upload videos and work with them (very important for me) - Gemini Live is awesome - Access to NotebookLM - Nano Banana Pro - Love Gemini Deep Research - 2TB Storage not a priority but it's good addition
English
10
4
249
13.5K
Antoine Yang retweetledi
Google
Google@Google·
Introducing Nano Banana Pro (Gemini 3 Pro Image), our new state-of-the-art image generation and editing model from @GoogleDeepMind. It improves on the original model while adding new advanced capabilities, enhanced world knowledge and text rendering, allowing you to create and edit studio-quality, production-ready visuals.
English
386
835
4.5K
1.9M
Antoine Yang retweetledi
koray kavukcuoglu
koray kavukcuoglu@koraykv·
2/4 It’s our most intelligent model yet: 🏆Tops LMArena Leaderboard at 1501 Elo and WebDev with 1487 Elo 📝 New SOTA in reasoning on Humanity’s Last Exam (37.5% w/o tools) and GPQA Diamond (91.9%) 🧠 Scores an industry-leading 31.1% on ARC-AGI-2 🎨 Breakthrough scores in multimodality, with 81% on MMMU-Pro and 87.6% on Video MMMU
koray kavukcuoglu tweet media
English
5
10
136
29.9K
Antoine Yang retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵
English
213
1.1K
6.5K
1.7M
Antoine Yang retweetledi
Ani Baddepudi
Ani Baddepudi@AniBaddepudi·
gemini's still the only frontier model that supports native video input (and is amazing at it!) incredible amount of real-world utility given how much of the world's information is increasingly in video
Ani Baddepudi tweet media
English
22
15
297
20.8K
Antoine Yang retweetledi
Andi Marafioti
Andi Marafioti@andimarafioti·
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈
Andi Marafioti tweet media
English
4
2
32
3.3K
Antoine Yang
Antoine Yang@AntoineYang2·
RT @Google: The plot thickens. 🕵️ Use Gemini 2.5 Pro to turn random videos from your camera roll into a dramatic narrative. Try it yoursel…
English
0
10
0
31
Antoine Yang retweetledi
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
We just shipped video FPS support in the Gemini API, so you can dynamically customize how many frames per second you want the model to see, unlocking lots of interesting new video use cases! 📹
Logan Kilpatrick tweet media
English
59
45
814
66.9K
Antoine Yang
Antoine Yang@AntoineYang2·
@ai_andrew This seems to be a parsing or instruction following issue? We do not look at test set results in long context settings often. For quite a few of these benchmarks, these are the best numbers we've measured, and this will remain stable now as it is GA 🙂
English
1
0
1
37
⟁ndrew V
⟁ndrew V@AI_Andrew·
Seriously, even better? Because there was a checkpoint earlier that seemed to be pretty spectacular and then a degraded so is it back to as good as it used to be or is it even better than ever? not trying to be a troll or anything like that just curious because when the model changes it changes the json structured output and I have to figure out a new schema. Or some other adjustment to accommodate. It was fantastic when we could have a context of 2 million for the longer videos or a collection of many short videos. I understand that time is in the past, but yeah, it’s better than ever or as good as it was before?
English
1
0
1
89
Antoine Yang
Antoine Yang@AntoineYang2·
The newly generally available Gemini 2.5 Flash and Pro are even better at video understanding than the versions we shared in the blog a month ago, see more details in the tech report 😀
Antoine Yang tweet media
Google DeepMind@GoogleDeepMind

Hot Gemini updates off the press. 🚀 Anyone can now use 2.5 Flash and Pro to build and scale production-ready AI applications. 🙌 We’re also launching 2.5 Flash-Lite in preview: the fastest model in the 2.5 family to respond to requests, with the lowest cost too. 🧵

English
2
16
104
12.4K