Mercury 2 departs from the autoregressive strategy that defines most large language models (LLMs). Instead of producing one token at a time left to right, Mercury 2 operates on a diffusion principle. It starts with a rough draft of the full response and refines multiple tokens in parallel across a small number of steps. Mercury 2 generates faster than autoregressive approaches. Live metrics on this page show current rates.
Mercury 2 supports tunable reasoning depth. You adjust refinement steps up or down to trade latency for quality on each request. Native tool use and schema-aligned JSON output let you embed it in function-calling pipelines and structured extraction workflows without extra parsing layers.
With a context window of 128K tokens, OpenAI API compatibility, and pricing of $0.25 input / $0.75 output per million tokens, Mercury 2 fits production-scale agentic workloads where inference runs dozens of times per task. Teams building multi-step coding assistants, retrieval-augmented generation (RAG) pipelines, or real-time voice interfaces gain headroom to run more refinement iterations within a fixed latency budget.