Text Multilingual Embedding 002
Text Multilingual Embedding 002 is an 18-language text embedding model achieving a 56.2% average score on the Massive Information Retrieval Across Languages (MIRACL) benchmark, designed for cross-lingual semantic search and retrieval across diverse language corpora.
import { embed } from 'ai';
const result = await embed({ model: 'google/text-multilingual-embedding-002', value: 'Sunny day at the beach',})About Text Multilingual Embedding 002
Text-multilingual-embedding-002 is Google's embedding model purpose-built for multilingual natural language processing (NLP) applications. Released alongside text-embedding-005 at Google Cloud Next '24, it uses the same Gecko architecture but targets cross-lingual coverage rather than maximum English-language benchmark performance. Its primary evaluation benchmark is MIRACL (Massive Information Retrieval Across Languages), covering 18 languages, where it achieves a 56.2% average score.
The practical value lies in vector space alignment across languages. Rather than running separate monolingual models for each language in your corpus, text-multilingual-embedding-002 embeds content from all 18 supported languages into a shared semantic space. A query submitted in one language can surface relevant documents written in any other supported language, without a translation step. For global products, international content platforms, or multilingual knowledge bases, this shared embedding space eliminates the complexity of language detection and routing.
Like its English-only sibling, text-multilingual-embedding-002 supports dynamic embedding sizes through Matryoshka Representation Learning (MRL). You can choose smaller dimension outputs to reduce vector storage and compute costs, with a minor quality tradeoff. This flexibility matters for multilingual applications where the corpus may be significantly larger than a monolingual equivalent.