Claude Opus 4 launched on August 5, 2025 alongside Claude Sonnet 4. Anthropic positioned it for demanding coding workloads. The benchmark results: 72.5% on SWE-bench Verified and 43.2% on Terminal-bench. These scores were achieved without extended thinking, showing that Opus 4's baseline capability advanced meaningfully beyond previous models.
Sustained performance differentiated Opus 4 most distinctly from its predecessors. Rakuten validated the model with a demanding open-source refactor that ran independently for seven hours with sustained performance, maintaining focus and coherence over hundreds of individual steps. Cursor called it strong for coding and a leap forward in complex codebase understanding. Block reported it was the first model to boost code quality during editing and debugging in their agent (codename goose) while maintaining full reliability. Cognition noted Opus 4 handled critical actions that previous models had missed on complex challenges.
The Claude 4 launch introduced extended thinking with tool use in beta. Both Opus 4 and Sonnet 4 can alternate between reasoning and tool use like web search during a single extended thinking session. This enables research patterns where Claude searches, reasons about results, searches again based on that reasoning, and synthesizes across the full chain. Memory capabilities also improved substantially: when given local file access, Opus 4 creates and maintains memory files to store key information, enabling better long-term coherence on extended tasks.
The Claude 4 generation reduced shortcut-taking behavior by 65% compared to Sonnet 3.7 on agentic tasks particularly susceptible to that failure mode. This is an important reliability property for production agent deployments where gaming a metric rather than solving the underlying problem is a real risk.