Rohan Doshi

86 posts

Rohan Doshi

Rohan Doshi

@RohanLikesAI

gemini multimodal, product @ deepmind. view are my own

Sumali Nisan 2020
203 Sinusundan1.4K Mga Tagasunod
Naka-pin na Tweet
Rohan Doshi
Rohan Doshi@RohanLikesAI·
🚀 We just launched Gemini 3 Pro — the strongest multimodal understanding model ever built. I lead product for Gemini’s multimodal vision capabilities, and I want to share more about the massive wins we are seeing across document, screen, spatial, and video understanding. 🧵
English
8
6
46
9.4K
Rohan Doshi nag-retweet
swyx 🇬🇧 @aidotengineer
filesystem + code sandbox combo eats another modality. remember when o3 destroyed at geoguessr? gemini agentic vision will find location on any street photo you take faster than Liam Neeson can get back his daughter
swyx 🇬🇧 @aidotengineer tweet media
Google Gemini@GeminiApp

Agentic Vision is rolling out now in the Gemini app when you select “Thinking” from the model drop-down. Learn more about Agentic Vision in Gemini 3 Flash: goo.gle/45zo5FH

English
21
12
156
23.8K
Rohan Doshi
Rohan Doshi@RohanLikesAI·
Back at Harvard Business School last week speaking on frontier AI + agents 🤖 As a ’23 alum, it was energizing to be back - this time teaching from the other side of the classroom My AI agent workshop was completely packed, with 100+ students - signal on how much Harvard is leaning into AI The students’ agency, raw IQ, and curiosity left me wildly optimistic about the next wave of AI builders 🚀 Grateful to Profs. Jeffrey Bussgang & Allison Mnookin and the Launching Tech Ventures team for the invite 🙏🏼
Rohan Doshi tweet mediaRohan Doshi tweet mediaRohan Doshi tweet media
English
0
2
14
1.2K
Rohan Doshi
Rohan Doshi@RohanLikesAI·
@RejaullahmdMd Select Gemini 3 Flash as the model, turn on the code execution model, and upload an image!
English
1
0
2
66
Rohan Doshi
Rohan Doshi@RohanLikesAI·
🚀 Excited to officially launch 👁Agentic Vision via Gemini 3 Flash. Gemini can run code execution on image uploads to zoom, analyze, and annotate: 🔍 Zoom: 5-10% quality win across vision benchmarks 🧮 Analyze: do image math with code (e.g. calculate the tip for a receipt) ✏️ Annotate: Draw arrows or bounding boxes to answer questions Try via the Gemini API (AI Studio / Vertex) or via the Gemini App (rolling out to Thinking mode today). Learn more→ goo.gle/4bsKdFv Demo: goo.gle/3Z05KxK cc: @IoanaBica95 @anastasija56572 @jalayrac @bcaine @eisenjulian @weichengkuo @phillip_lippe @xf1280 @tulseedoshi @BiboXu @OfficialLoganK
Google AI Developers@googleaidevs

Try 👁 Agentic Vision with Gemini 3 Flash in @GoogleAIStudio or Vertex AI. This new capability enables the model to effectively use code and reasoning to improve performance for common vision tasks. See Agentic Vision in action: goo.gle/3Z05KxK

English
8
27
237
29.3K
Jake Baumann
Jake Baumann@Jake_Joseph·
@RohanLikesAI Thinking about using it as a fidelity/quality layer check for @AdMachineAI on our outputs to check vs. input product. If it doesn’t meet a threshold it automatically regenerates.
English
1
0
1
103
Shrestha Basu Mallick
Shrestha Basu Mallick@shresbm·
Agentic Vision in Gemini 3 Flash converts image understanding into an active, agentic investigation. By combining visual reasoning with code execution, the model formulates plans to zoom in, inspect and manipulate images step-by-step, grounding answers in visual evidence. Below see an example of image
English
1
1
10
305
Rohan Doshi
Rohan Doshi@RohanLikesAI·
@marcosmarf27 @OfficialLoganK can you help me better understand your entire pipeline better? what are "vision resources". Are you using another upstream system to do OCR? And are you feeding the OCR text and the PDF images into Gemini from there?
English
0
0
1
22
Rohan Doshi
Rohan Doshi@RohanLikesAI·
@deedydas @deedydas 👋 glad you’re a fan of the launch (I’m the Gemini multimodal vision PM) - feel free to DM if you have any feedback for the team on doc understanding
English
0
0
10
1K
Rohan Doshi nag-retweet
Deedy
Deedy@deedydas·
Gemini 3 Flash is insane at OCR. It parses this extremely hard to read handwritten letter by Richard Feynman perfectly. It can do ~300 of these for $1. What's crazy is Feynman addresses General Donald J. Kutyna as "Katyna" which Gemini gets. There is no "Meeting Katyna", the first part of the letter, in all of Google search!
Deedy tweet media
English
64
159
1.7K
177.1K
Matías
Matías@matidotlol·
@googleaidevs @OfficialLoganK where is the best place to report problems with Gemini API? for whatever reason, when using structured outputs, Gemini 3 Flash just ignores the vast majority of the PDF input. it works fine without structured outputs
English
1
0
1
166
Rohan Doshi
Rohan Doshi@RohanLikesAI·
@marcosmarf27 @OfficialLoganK Hey Marcos! I’m the Gemini Vision PM working on doc understanding. Would love to learn more and see if we can help. Will DM you.
English
3
0
7
941
Rohan Doshi
Rohan Doshi@RohanLikesAI·
⚡️ Gemini 3 Flash just got a major new capability: code execution for images. Gemini can decide when to write code to zoom, count, and annotate—unlocking the next phase of Agentic Vision. 🤖 Wildly fun to PM this 0→1. Stay tuned to hear more 👀
Fei Xia@xf1280

🚀Excited to share that #Gemini 3 Flash can do code execution on images to zoom, count, and annotate visual inputs! The model can choose when to write code to: 🔍 Zoom & Inspect: Detect when details are too small and zoom-in. 🧮 Compute Visually: Run multi-step calculations using code (e.g., summing line items on a receipt). ✏️ Annotate: Draw arrows or bounding boxes to answer questions or show relationships between objects.

English
5
10
99
8.8K
Google Gemini
Google Gemini@GeminiApp·
With Gemini 3 Flash, you can upload a short video to get a quick, easy-to-read analysis. Try Gemini 3 Flash in the app today.
English
60
111
867
3.6M