Skip to content
Vercel April 2026 security incident

Text Multilingual Embedding 002

google/text-multilingual-embedding-002

Text Multilingual Embedding 002 is an 18-language text embedding model achieving a 56.2% average score on the Massive Information Retrieval Across Languages (MIRACL) benchmark, designed for cross-lingual semantic search and retrieval across diverse language corpora.

index.ts
import { embed } from 'ai';
const result = await embed({
model: 'google/text-multilingual-embedding-002',
value: 'Sunny day at the beach',
})

What To Consider When Choosing a Provider

  • Zero Data Retention

    AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.

    Authentication

    AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

For multilingual retrieval applications, this model maps text from all supported languages into the same vector space. That enables cross-lingual queries: for example, a user querying in Japanese can retrieve documents written in Spanish without a query translation layer.

When to Use Text Multilingual Embedding 002

Best For

  • Multilingual semantic search:

    Applications serving users who query in different languages than the indexed content

  • Cross-lingual document retrieval:

    Knowledge base search across international content corpora

  • Global customer support:

    Systems where user questions and knowledge base articles span multiple languages

  • Multilingual clustering and classification:

    Tasks that need consistent semantic representations across languages

  • International content platforms:

    E-commerce or media indexing product descriptions or articles in multiple languages

Consider Alternatives When

  • English-only corpus:

    Your corpus and users are exclusively English-language (consider google/text-embedding-005 for higher MTEB scores)

  • Unsupported language needed:

    You require a language not covered by the 18-language MIRACL benchmark, verify support in the Vertex AI documentation

  • Peak English retrieval quality:

    Multilingual support is not required and maximum English performance is the primary criterion

Conclusion

Text-multilingual-embedding-002 solves the core infrastructure challenge of multilingual retrieval: maintaining a single vector index that serves queries and documents across 18 languages without translation layers or per-language model management. For global applications where your user base and content corpus span multiple languages, it provides the embedding foundation that makes cross-lingual semantic search tractable.

FAQ

The model is evaluated on MIRACL, which covers 18 languages. Text Multilingual Embedding 002 scores 56.2% on average on this benchmark. Consult the Vertex AI documentation for the complete list of supported languages.

MIRACL (Massive Information Retrieval Across Languages) is a multilingual retrieval benchmark covering 18 languages, used to evaluate cross-lingual information retrieval quality. MTEB is an English-language benchmark covering eight task categories. The two models in this family are each evaluated on the benchmark most relevant to their design target.

Yes. This is the key capability of a shared multilingual embedding space. Text from all supported languages is mapped into the same vector space, so a query in Japanese and a matching document in Arabic will have similar vector representations, enabling cross-lingual retrieval without query translation.

Yes. Like text-embedding-005, it uses Matryoshka Representation Learning to support multiple output dimension sizes. Smaller dimensions reduce vector storage and compute costs with a minor quality tradeoff.

Use text-multilingual-embedding-002 whenever your application must handle content or queries in multiple languages. Use text-embedding-005 for strictly English-language applications where maximum MTEB benchmark performance is the priority.

Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves Text Multilingual Embedding 002.

Yes. The shared vector space means that classifiers trained on labeled data in one language can classify documents in other supported languages, which is useful for content moderation, sentiment analysis, and topic categorization across multilingual corpora.

No. The model handles all 18 supported languages from a single endpoint. Language detection and routing are not required: submit text in any supported language and the model produces an embedding in the shared multilingual vector space.