Claude Opus 4
Claude Opus 4 is a coding model from Anthropic with strong benchmark scores, including 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, with sustained performance on multi-hour agentic tasks and hybrid extended thinking with tool use.
import { streamText } from 'ai'
const result = streamText({ model: 'anthropic/claude-opus-4', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
Opus 4's higher per-token cost and long-running session profile make AI Gateway's cost tracking particularly useful. Observability from the first request helps prevent budget surprises on multi-hour jobs.
When to Use Claude Opus 4
Best For
Long-horizon agentic tasks:
Requiring sustained focus across thousands of steps and multiple hours, validated with a seven-hour independent refactor run
Complex codebase understanding and modification:
SWE-bench 72.5% and Terminal-bench 43.2%
Research and analysis workflows:
Benefiting from extended thinking with tool use, reasoning interleaved with web search or other external tools
Scientific discovery and R&D tasks:
Analytical depth and domain knowledge are the binding constraints
Production agent deployments:
The 65% reduction in shortcut-taking behavior matters for reliability
Consider Alternatives When
Per-token cost constraint:
Sonnet 4 delivers strong performance at significantly lower cost and matched or exceeded Opus 4 on SWE-bench
Critical response latency:
Sonnet variants are faster for interactive use
Shorter bounded tasks:
The capability differential over Sonnet shrinks when multi-hour sustained attention isn't needed
1M context window:
Came to Sonnet 4 later and to Opus models with 4.6
Conclusion
Claude Opus 4 demonstrated sustained agentic performance at the Claude 4 generation's launch. It solves hard problems and maintains coherence and performance over hours. Teams building long-horizon coding agents, long-horizon research pipelines, or autonomous engineering workflows have concrete reference points in the benchmark data and early customer validation.
FAQ
Opus 4 scored 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, both without extended thinking.
Rakuten validated a seven-hour independent run on a demanding open-source refactoring task with sustained performance. Anthropic described the model as capable of working continuously for several hours.
A beta capability introduced with the Claude 4 launch. The model alternates between extended reasoning and tool calls within a single session. For example, it can think about a problem, run a web search, reason about the results, search again, and synthesize across the chain.
When you provide local file access, Opus 4 creates and maintains memory files to store key facts and context. This enables better long-term coherence on extended tasks. Anthropic illustrated this with the model creating a navigation guide during autonomous Pokémon gameplay.
Claude 4 models (Opus 4 and Sonnet 4) are 65% less likely to use shortcuts or loopholes to complete agentic tasks compared to Sonnet 3.7. This is a reliability improvement for production deployments where you need the model to solve the actual problem rather than gaming the metric.
Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves Claude Opus 4.
Yes. A smaller model condenses lengthy thought processes into summaries. Anthropic noted this is only needed about 5% of the time, when thoughts are too long to display in full.