Text-multilingual-embedding-002 is Google's embedding model purpose-built for multilingual natural language processing (NLP) applications. Released alongside text-embedding-005 at Google Cloud Next '24, it uses the same Gecko architecture but targets cross-lingual coverage rather than maximum English-language benchmark performance. Its primary evaluation benchmark is MIRACL (Massive Information Retrieval Across Languages), covering 18 languages, where it achieves a 56.2% average score.
The practical value lies in vector space alignment across languages. Rather than running separate monolingual models for each language in your corpus, text-multilingual-embedding-002 embeds content from all 18 supported languages into a shared semantic space. A query submitted in one language can surface relevant documents written in any other supported language, without a translation step. For global products, international content platforms, or multilingual knowledge bases, this shared embedding space eliminates the complexity of language detection and routing.
Like its English-only sibling, text-multilingual-embedding-002 supports dynamic embedding sizes through Matryoshka Representation Learning (MRL). You can choose smaller dimension outputs to reduce vector storage and compute costs, with a minor quality tradeoff. This flexibility matters for multilingual applications where the corpus may be significantly larger than a monolingual equivalent.