What makes Mercury 2 architecturally different from other reasoning models?

It uses diffusion instead of autoregressive generation. Mercury 2 starts with a draft of the full response and refines all token positions simultaneously across iterative steps, rather than generating one token at a time left to right. That follows the same conceptual lineage as image and video diffusion models, applied to language.

How does tunable reasoning depth work in Mercury 2?

You adjust the number of diffusion refinement steps at inference time. Fewer steps yield faster responses; more steps let the model converge on higher-quality answers. You match compute to task difficulty on each request.

What throughput does Mercury 2 achieve compared to autoregressive reasoning models?

Mercury 2 generates faster than autoregressive approaches. Live throughput metrics appear on this page.

Is Mercury 2 compatible with OpenAI client libraries?

Yes. Mercury 2 exposes an OpenAI-compatible API. Through AI Gateway, call Mercury 2 with the AI SDK, Chat Completions API, Responses API, Messages API, or other API formats, from TypeScript or Python. Set the base URL to AI Gateway and the model identifier to `inception/mercury-2`; existing OpenAI SDK code routes through without further changes.

What context length does Mercury 2 support?

A context window of 128K tokens. That suits long document processing, extended conversation history, and multi-document retrieval tasks.

Does Mercury 2 support structured output for agent orchestration?

Yes. Mercury 2 includes native schema-aligned JSON output and tool use. You can plug it into function-calling orchestration frameworks without extra parsing middleware.

How is Mercury 2 priced?

This page lists the current rates. Multiple providers can serve Mercury 2, so AI Gateway surfaces live pricing rather than a single fixed figure.

Mercury 2

Mercury 2 is Inception's reasoning diffusion language model. It refines tokens in parallel with tunable reasoning depth, native tool use, and a context window of 128K tokens.

Tool UseReasoning

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'inception/mercury-2',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Latency Uptime Status Similar FAQ

Playground

Try out Mercury 2 by Inception. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

128K

0.4s

$0.25/M

$0.75/M

Read:$0.03/M

Write:—

—

02/24/2026

More models by Inception

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

32K

0.3s

$0.25/M

$1.00/M

—

02/26/2025

About Mercury 2

Mercury 2 departs from the autoregressive strategy that defines most large language models (LLMs). Instead of producing one token at a time left to right, Mercury 2 operates on a diffusion principle. It starts with a rough draft of the full response and refines multiple tokens in parallel across a small number of steps. Mercury 2 generates faster than autoregressive approaches. Live metrics on this page show current rates.

Mercury 2 supports tunable reasoning depth. You adjust refinement steps up or down to trade latency for quality on each request. Native tool use and schema-aligned JSON output let you embed it in function-calling pipelines and structured extraction workflows without extra parsing layers.

With a context window of 128K tokens, OpenAI API compatibility, and pricing of $0.25 input / $0.75 output per million tokens, Mercury 2 fits production-scale agentic workloads where inference runs dozens of times per task. Teams building multi-step coding assistants, retrieval-augmented generation (RAG) pipelines, or real-time voice interfaces gain headroom to run more refinement iterations within a fixed latency budget.

What To Consider When Choosing a Provider

Configuration: Mercury 2's diffusion architecture generates tokens in parallel rather than sequentially. Latency differs from autoregressive models, so factor that into timeout and streaming configurations for latency-sensitive pipelines.
Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Mercury 2

Best For

Sequential agent loops: Chains of many inference calls need low per-step latency
Real-time voice backends: Response delay is perceptible to end users
High-throughput coding assistants: Many simultaneous requests processed concurrently
Fast structured RAG: Retrieval summarization returned as JSON output
Token cost optimization: Diffusion-based parallel token refinement reduces per-token inference cost compared to autoregressive models

Consider Alternatives When

Very long outputs: Tasks push against the cap of 128K tokens
Domain-specific benchmarks: Evaluation prioritizes specific benchmarks over raw throughput
Token-by-token streaming: Pipeline assumes autoregressive generation patterns
Multimodal input required: You need image or audio input alongside text reasoning

Conclusion

Mercury 2 brings a different execution model to production reasoning workloads. Diffusion-based parallel refinement keeps throughput high while preserving tool calling, structured output, and tunable reasoning depth. If inference latency or per-call cost limits how you scale your product, use Mercury 2 on Vercel AI Gateway. Open https://ai-sdk.dev/playground/inception:mercury-2 to try it interactively.

Frequently Asked Questions

What makes Mercury 2 architecturally different from other reasoning models?
It uses diffusion instead of autoregressive generation. Mercury 2 starts with a draft of the full response and refines all token positions simultaneously across iterative steps, rather than generating one token at a time left to right. That follows the same conceptual lineage as image and video diffusion models, applied to language.
How does tunable reasoning depth work in Mercury 2?
You adjust the number of diffusion refinement steps at inference time. Fewer steps yield faster responses; more steps let the model converge on higher-quality answers. You match compute to task difficulty on each request.
What throughput does Mercury 2 achieve compared to autoregressive reasoning models?
Mercury 2 generates faster than autoregressive approaches. Live throughput metrics appear on this page.
Is Mercury 2 compatible with OpenAI client libraries?
Yes. Mercury 2 exposes an OpenAI-compatible API. Through AI Gateway, call Mercury 2 with the AI SDK, Chat Completions API, Responses API, Messages API, or other API formats, from TypeScript or Python. Set the base URL to AI Gateway and the model identifier to inception/mercury-2; existing OpenAI SDK code routes through without further changes.
What context length does Mercury 2 support?
A context window of 128K tokens. That suits long document processing, extended conversation history, and multi-document retrieval tasks.
Does Mercury 2 support structured output for agent orchestration?
Yes. Mercury 2 includes native schema-aligned JSON output and tool use. You can plug it into function-calling orchestration frameworks without extra parsing middleware.
How is Mercury 2 priced?
This page lists the current rates. Multiple providers can serve Mercury 2, so AI Gateway surfaces live pricing rather than a single fixed figure.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users