
I'm reading through Google's Gemini Embedding 2 release and here's what I think it actually opens up for anyone building AI systems.
Embedding models are what let AI systems search through data by meaning instead of keywords.
They turn your text/images/videos into numbers (vectors) that AI can compare.
Before this, if you were building an AI system (like a chatbot that answers questions using your company's documents, videos, and images) you needed separate embedding models for each type of data.
You'd embed all your text with one model. Images with a different one. Videos with another. Audio with yet another.
Four different embedding spaces. Four different systems to maintain.
Gemini Embedding 2 collapses all of that into one model.
Which (if I'm understanding this correctly) means if you're building an AI assistant and someone asks about "labradoodle," the system can now pull from:
- Text documents mentioning labradoodles
- Photos of labradoodles
- Videos of them playing
- Audio of them barking
All from one unified embedding space.
My dog is sitting behind me as I'm writing this (hence the labradoodle reference). When I think about him, I don't separate "visual memory" from "audio memory" from "text description." I just think about him, all of it at once.
Gemini Embedding 2 treats text, images, video, and audio as different expressions of the same underlying meaning.
Which is how humans have always thought.
We just accepted for years that AI systems couldn't work that way. That you had to build separate infrastructure for each modality. Separate pipelines. Separate teams.
I'm reading this thinking, we don't have to do that anymore? 😅
I don't know where this goes, but it feels like we just removed a pretty fundamental limitation in how we build AI systems.
Google AI Developers@googleaidevs
Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI.
English
















