Claude Opus 4
Claude Opus 4 is a coding model from Anthropic with strong benchmark scores, including 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, with sustained performance on multi-hour agentic tasks and hybrid extended thinking with tool use.
import { streamText } from 'ai'
const result = streamText({ model: 'anthropic/claude-opus-4', prompt: 'Why is the sky blue?'})Playground
Try out Claude Opus 4 by Anthropic. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Anthropic
| Model |
|---|
About Claude Opus 4
Claude Opus 4 launched on August 5, 2025 alongside Claude Sonnet 4. Anthropic positioned it for demanding coding workloads. The benchmark results: 72.5% on SWE-bench Verified and 43.2% on Terminal-bench. These scores were achieved without extended thinking, showing that Opus 4's baseline capability advanced meaningfully beyond previous models.
Sustained performance differentiated Opus 4 most distinctly from its predecessors. Rakuten validated the model with a demanding open-source refactor that ran independently for seven hours with sustained performance, maintaining focus and coherence over hundreds of individual steps. Cursor called it strong for coding and a leap forward in complex codebase understanding. Block reported it was the first model to boost code quality during editing and debugging in their agent (codename goose) while maintaining full reliability. Cognition noted Opus 4 handled critical actions that previous models had missed on complex challenges.
The Claude 4 launch introduced extended thinking with tool use in beta. Both Opus 4 and Sonnet 4 can alternate between reasoning and tool use like web search during a single extended thinking session. This enables research patterns where Claude searches, reasons about results, searches again based on that reasoning, and synthesizes across the full chain. Memory capabilities also improved substantially: when given local file access, Opus 4 creates and maintains memory files to store key information, enabling better long-term coherence on extended tasks.
The Claude 4 generation reduced shortcut-taking behavior by 65% compared to Sonnet 3.7 on agentic tasks particularly susceptible to that failure mode. This is an important reliability property for production agent deployments where gaming a metric rather than solving the underlying problem is a real risk.
What To Consider When Choosing a Provider
- Configuration: Opus 4's higher per-token cost and long-running session profile make AI Gateway's cost tracking particularly useful. Observability from the first request helps prevent budget surprises on multi-hour jobs.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Claude Opus 4
Best For
- Long-horizon agentic tasks: Requiring sustained focus across thousands of steps and multiple hours, validated with a seven-hour independent refactor run
- Complex codebase understanding and modification: SWE-bench 72.5% and Terminal-bench 43.2%
- Research and analysis workflows: Benefiting from extended thinking with tool use, reasoning interleaved with web search or other external tools
- Scientific discovery and R&D tasks: Analytical depth and domain knowledge are the binding constraints
- Production agent deployments: The 65% reduction in shortcut-taking behavior matters for reliability
Consider Alternatives When
- Per-token cost constraint: Sonnet 4 delivers strong performance at significantly lower cost and matched or exceeded Opus 4 on SWE-bench
- Critical response latency: Sonnet variants are faster for interactive use
- Shorter bounded tasks: The capability differential over Sonnet shrinks when multi-hour sustained attention isn't needed
- 1M context window: Came to Sonnet 4 later and to Opus models with 4.6
Conclusion
Claude Opus 4 demonstrated sustained agentic performance at the Claude 4 generation's launch. It solves hard problems and maintains coherence and performance over hours. Teams building long-horizon coding agents, long-horizon research pipelines, or autonomous engineering workflows have concrete reference points in the benchmark data and early customer validation.
Frequently Asked Questions
What SWE-bench and Terminal-bench scores did Claude Opus 4 achieve?
Opus 4 scored 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, both without extended thinking.
How long can Claude Opus 4 run an agentic task without losing coherence?
Rakuten validated a seven-hour independent run on a demanding open-source refactoring task with sustained performance. Anthropic described the model as capable of working continuously for several hours.
What is extended thinking with tool use in Claude Opus 4?
A beta capability introduced with the Claude 4 launch. The model alternates between extended reasoning and tool calls within a single session. For example, it can think about a problem, run a web search, reason about the results, search again, and synthesize across the chain.
How did Claude Opus 4 improve memory capabilities?
When you provide local file access, Opus 4 creates and maintains memory files to store key facts and context. This enables better long-term coherence on extended tasks. Anthropic illustrated this with the model creating a navigation guide during autonomous Pokémon gameplay.
What was the shortcut-taking behavior reduction?
Claude 4 models (Opus 4 and Sonnet 4) are 65% less likely to use shortcuts or loopholes to complete agentic tasks compared to Sonnet 3.7. This is a reliability improvement for production deployments where you need the model to solve the actual problem rather than gaming the metric.
How does Opus 4 pricing compare to Sonnet 4?
Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves Claude Opus 4.
Does Claude Opus 4 support thinking summaries?
Yes. A smaller model condenses lengthy thought processes into summaries. Anthropic noted this is only needed about 5% of the time, when thoughts are too long to display in full.