Google released Gemini Embedding 2 on March 10, 2026 in Public Preview as its first fully multimodal embedding model built on the Gemini architecture. It expands on the text-only foundation of its predecessor by mapping text, images, videos, audio, and documents into a single unified embedding space, with semantic understanding across over 100 languages.
Unifying modalities in a single vector space is the core architectural advance. Unlike systems that maintain separate embedding models per modality and then attempt cross-modal alignment after the fact, Gemini Embedding 2 processes all five modalities natively. It produces vectors that are directly comparable regardless of source medium. The model natively understands interleaved input: a request can pass multiple modalities simultaneously (such as an image and its accompanying text caption), and the model captures the relationships between them in a single embedding.
Modality-specific input constraints apply: text up to 8,192 tokens (four times the limit of gemini-embedding-001), up to six images per request in PNG or JPEG format, video up to 120 seconds in MP4 or MOV, audio natively without intermediate transcription, and PDFs up to six pages. The broader text context window is particularly relevant for long-document retrieval pipelines.
Like its predecessor, Gemini Embedding 2 uses Matryoshka Representation Learning (MRL), allowing output dimensions to scale down from the default 3,072. Google recommends 3,072, 1,536, or 768 for highest quality, giving you the same flexibility to balance retrieval accuracy against vector storage costs. The model is available through the Gemini API, Vertex AI, and ecosystem integrations including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search.