What SWE-bench and Terminal-bench scores did Claude Opus 4 achieve?

Opus 4 scored 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, both without extended thinking.

How long can Claude Opus 4 run an agentic task without losing coherence?

Rakuten validated a seven-hour independent run on a demanding open-source refactoring task with sustained performance. Anthropic described the model as capable of working continuously for several hours.

What is extended thinking with tool use in Claude Opus 4?

A beta capability introduced with the Claude 4 launch. The model alternates between extended reasoning and tool calls within a single session. For example, it can think about a problem, run a web search, reason about the results, search again, and synthesize across the chain.

How did Claude Opus 4 improve memory capabilities?

When you provide local file access, Opus 4 creates and maintains memory files to store key facts and context. This enables better long-term coherence on extended tasks. Anthropic illustrated this with the model creating a navigation guide during autonomous Pokémon gameplay.

What was the shortcut-taking behavior reduction?

Claude 4 models (Opus 4 and Sonnet 4) are 65% less likely to use shortcuts or loopholes to complete agentic tasks compared to Sonnet 3.7. This is a reliability improvement for production deployments where you need the model to solve the actual problem rather than gaming the metric.

How does Opus 4 pricing compare to Sonnet 4?

Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves Claude Opus 4.

Does Claude Opus 4 support thinking summaries?

Yes. A smaller model condenses lengthy thought processes into summaries. Anthropic noted this is only needed about 5% of the time, when thoughts are too long to display in full.

Claude Opus 4

Claude Opus 4 is a coding model from Anthropic with strong benchmark scores, including 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, with sustained performance on multi-hour agentic tasks and hybrid extended thinking with tool use.

File InputReasoningTool UseVision (Image)Explicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'anthropic/claude-opus-4',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out Claude Opus 4 by Anthropic. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

200K

1.6s

44tps

$15.00/M

$75.00/M

Read:$1.5/M

Write:

$18.75/M

$10.00/K

+ input costs

—

08/05/2025

Legal:Terms

•

Privacy

200K

3.1s

18tps

$15.00/M

$75.00/M

Read:$1.5/M

Write:

$18.75/M

—

08/05/2025

Legal:Terms

•

Privacy

200K

1.3s

44tps

$15.00/M

$75.00/M

Read:$1.5/M

Write:

$18.75/M

$10.00/K

+ input costs

—

08/05/2025

More models by Anthropic

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

0.8s

102tps

$5.00/M

$25.00/M

Read:$0.5/M

Write:

$6.25/M

$10/K

+ input costs

—

04/16/2026

0.7s

54tps

$3.00/M

$15.00/M

Read:$0.3/M

Write:

$3.75/M

$10/K

+ input costs

—

02/17/2026

0.7s

48tps

$5.00/M

$25.00/M

Read:$0.5/M

Write:

$6.25/M

$10/K

+ input costs

—

02/05/2026

200K

0.6s

111tps

$1.00/M

$5.00/M

Read:$0.1/M

Write:

$1.25/M

$10.00/K

+ input costs

—

10/15/2025

0.7s

58tps

$3.00/M

$15.00/M

Read:

$0.3/M

Write:

$3.75/M

$10.00/K

+ input costs

—

09/29/2025

200K

0.6s

52tps

$5.00/M

$25.00/M

Read:$0.5/M

Write:

$6.25/M

$10.00/K

+ input costs

—

11/24/2024

About Claude Opus 4

Claude Opus 4 launched on August 5, 2025 alongside Claude Sonnet 4. Anthropic positioned it for demanding coding workloads. The benchmark results: 72.5% on SWE-bench Verified and 43.2% on Terminal-bench. These scores were achieved without extended thinking, showing that Opus 4's baseline capability advanced meaningfully beyond previous models.

Sustained performance differentiated Opus 4 most distinctly from its predecessors. Rakuten validated the model with a demanding open-source refactor that ran independently for seven hours with sustained performance, maintaining focus and coherence over hundreds of individual steps. Cursor called it strong for coding and a leap forward in complex codebase understanding. Block reported it was the first model to boost code quality during editing and debugging in their agent (codename goose) while maintaining full reliability. Cognition noted Opus 4 handled critical actions that previous models had missed on complex challenges.

The Claude 4 launch introduced extended thinking with tool use in beta. Both Opus 4 and Sonnet 4 can alternate between reasoning and tool use like web search during a single extended thinking session. This enables research patterns where Claude searches, reasons about results, searches again based on that reasoning, and synthesizes across the full chain. Memory capabilities also improved substantially: when given local file access, Opus 4 creates and maintains memory files to store key information, enabling better long-term coherence on extended tasks.

The Claude 4 generation reduced shortcut-taking behavior by 65% compared to Sonnet 3.7 on agentic tasks particularly susceptible to that failure mode. This is an important reliability property for production agent deployments where gaming a metric rather than solving the underlying problem is a real risk.

What To Consider When Choosing a Provider

Configuration: Opus 4's higher per-token cost and long-running session profile make AI Gateway's cost tracking particularly useful. Observability from the first request helps prevent budget surprises on multi-hour jobs.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Claude Opus 4

Best For

Long-horizon agentic tasks: Requiring sustained focus across thousands of steps and multiple hours, validated with a seven-hour independent refactor run
Complex codebase understanding and modification: SWE-bench 72.5% and Terminal-bench 43.2%
Research and analysis workflows: Benefiting from extended thinking with tool use, reasoning interleaved with web search or other external tools
Scientific discovery and R&D tasks: Analytical depth and domain knowledge are the binding constraints
Production agent deployments: The 65% reduction in shortcut-taking behavior matters for reliability

Consider Alternatives When

Per-token cost constraint: Sonnet 4 delivers strong performance at significantly lower cost and matched or exceeded Opus 4 on SWE-bench
Critical response latency: Sonnet variants are faster for interactive use
Shorter bounded tasks: The capability differential over Sonnet shrinks when multi-hour sustained attention isn't needed
1M context window: Came to Sonnet 4 later and to Opus models with 4.6

Conclusion

Claude Opus 4 demonstrated sustained agentic performance at the Claude 4 generation's launch. It solves hard problems and maintains coherence and performance over hours. Teams building long-horizon coding agents, long-horizon research pipelines, or autonomous engineering workflows have concrete reference points in the benchmark data and early customer validation.

Frequently Asked Questions

What SWE-bench and Terminal-bench scores did Claude Opus 4 achieve?
Opus 4 scored 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, both without extended thinking.
How long can Claude Opus 4 run an agentic task without losing coherence?
Rakuten validated a seven-hour independent run on a demanding open-source refactoring task with sustained performance. Anthropic described the model as capable of working continuously for several hours.
What is extended thinking with tool use in Claude Opus 4?
A beta capability introduced with the Claude 4 launch. The model alternates between extended reasoning and tool calls within a single session. For example, it can think about a problem, run a web search, reason about the results, search again, and synthesize across the chain.
How did Claude Opus 4 improve memory capabilities?
When you provide local file access, Opus 4 creates and maintains memory files to store key facts and context. This enables better long-term coherence on extended tasks. Anthropic illustrated this with the model creating a navigation guide during autonomous Pokémon gameplay.
What was the shortcut-taking behavior reduction?
Claude 4 models (Opus 4 and Sonnet 4) are 65% less likely to use shortcuts or loopholes to complete agentic tasks compared to Sonnet 3.7. This is a reliability improvement for production deployments where you need the model to solve the actual problem rather than gaming the metric.
How does Opus 4 pricing compare to Sonnet 4?
Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves Claude Opus 4.
Does Claude Opus 4 support thinking summaries?
Yes. A smaller model condenses lengthy thought processes into summaries. Anthropic noted this is only needed about 5% of the time, when thoughts are too long to display in full.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Claude Opus 4

Playground

Providers

More models by Anthropic

About Claude Opus 4

What To Consider When Choosing a Provider

When to Use Claude Opus 4

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions