Cohere Rerank 4 Fast
Cohere Rerank 4 Fast is a multilingual reranking model from Cohere tuned for low-latency, high-throughput retrieval over English and non-English documents and semi-structured JSON.
import { rerank } from 'ai';
const result = await rerank({ model: 'cohere/rerank-v4-fast', query: 'What is the capital of France?', documents: [ 'Paris is the capital of France.', 'Berlin is the capital of Germany.', 'Madrid is the capital of Spain.', ],})Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
More models by Cohere
| Model |
|---|
About Cohere Rerank 4 Fast
Cohere Rerank 4 Fast is the fast tier of Cohere's Rerank 4 generation, released December 11, 2025 alongside rerank-v4-pro. It shares the v4 family's multilingual coverage and JSON support but is tuned for lower per-query latency and higher throughput.
Like its siblings, Cohere Rerank 4 Fast is a cross-encoder. It reads the query and each candidate document together through attention, scoring relevance directly rather than relying on independent vector similarity. That structure picks up signal on complex, multi-part, or ambiguous queries that bi-encoder embeddings flatten.
Cohere Rerank 4 Fast covers more than 100 languages, including the same multilingual coverage as Cohere's embed-multilingual family. A French query can match Japanese documents and vice versa under one model. It handles long-form text, tables, code, and semi-structured JSON records with the same per-document context window of 32K tokens.
The fast variant fits at the front of high-traffic search systems and agentic retrieval steps where the reranker runs on every user turn. It pairs naturally with a multilingual embedding retriever for first-pass candidate selection and trades some quality versus rerank-v4-pro for response time under load.
See https://cohere.com/blog/rerank-4 for the request format, including top_n, return_documents, and per-document tokenization rules. Reranking is billed per search query rather than per token.
What To Consider When Choosing a Provider
- Configuration: Rerankers run after an initial retrieval step. Pair Cohere Rerank 4 Fast with an embedding model, keyword search, or hybrid retriever so it has a candidate pool to score. The per-document context of 32K tokens covers query and document tokens together, which is wide enough to score long passages and chunked enterprise documents without truncation in most cases.
- Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Cohere Rerank 4 Fast
Best For
- Multilingual reranking: One model covers 100+ languages for cross-lingual retrieval
- High-throughput search: Lower latency keeps reranker overhead small under load
- Agentic retrieval loops: Tool calls that retrieve then rerank on every turn stay within tight time budgets
- Mixed-format corpora: Long text, JSON, tables, and code share one reranking stage
- Cost-sensitive RAG: A lower per-query price than the Pro variant at scale
Consider Alternatives When
- Maximum quality:
rerank-v4-protargets state-of-the-art relevance on complex queries - English-only corpora:
rerank-v3.5covers English RAG at a lower price point - No second stage needed: First-pass retrieval alone meets the accuracy bar
- Image or multimodal retrieval: Use a multimodal retrieval model instead
Conclusion
Cohere Rerank 4 Fast is the right reranker when multilingual coverage and latency both matter, and a small quality gap versus the Pro variant is acceptable. Route it through AI Gateway with model id cohere/rerank-v4-fast for unified billing across the retrieval and generation stages of the same pipeline.
Frequently Asked Questions
How does Cohere Rerank 4 Fast differ from
rerank-v4-pro?Cohere Rerank 4 Fast is the latency-optimized variant in the Rerank 4 family, tuned for low-latency, high-throughput use cases.
rerank-v4-protargets the highest relevance quality on complex queries. Both share the same multilingual coverage and per-document context of 32K tokens.Which languages does Cohere Rerank 4 Fast support?
More than 100 languages, with the same multilingual coverage as Cohere's embed-multilingual family. Cross-lingual queries work in one call, so a query in one language can match documents in another.
What document types can Cohere Rerank 4 Fast rerank?
Long-form text, semi-structured JSON, tables, code, and email-style records. The per-document context window is 32K tokens, shared between the query and the document.
How is Cohere Rerank 4 Fast billed on AI Gateway?
Reranking is priced per search query rather than per token. See the pricing section on this page for the current per-query rate.
Do I still need an embedding model with Cohere Rerank 4 Fast?
Yes, for the first-pass retrieval step. Cohere Rerank 4 Fast scores a candidate set against the query; it does not retrieve from the full corpus. A common pattern is to retrieve 50 to 200 candidates with an embedding model, then rerank them down to a top-k of 5 to 20.
How does Cohere Rerank 4 Fast compare to
rerank-v3.5?rerank-v3.5targets English and matchesembed-multilingual-v3.0coverage. Cohere Rerank 4 Fast is part of the Rerank 4 generation, explicitly multilingual across 100+ languages, and tuned for lower latency thanrerank-v4-pro.Does Cohere Rerank 4 Fast support Zero Data Retention?
Zero Data Retention is not currently available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.