The upgraded Claude 3.5 Sonnet (October 2024) is the first publicly available model to offer computer use in public beta, with SWE-bench Verified scores jumping from 33.4% to 49.0%, plus across-the-board coding and tool use improvements at the same price as its predecessor.
import { streamText } from 'ai'
const result = streamText({ model: 'anthropic/claude-3.5-sonnet', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
- Configuration: Computer use tasks involving dozens or hundreds of steps generate substantial token volumes. AI Gateway's cost tracking lets you measure actual token consumption per session rather than estimating upfront.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Claude 3.5 Sonnet
Best For
- Computer use and UI automation: The first model to offer this in public beta, suited for tasks requiring navigation of real software interfaces
- Complex software engineering tasks at scale: The SWE-bench 49.0% result placed it above all publicly available models in that class
- Multi-step agentic coding: Tool calls, test execution, and full-stack workflows
- DevSecOps and code review pipelines: GitLab reported stronger reasoning on multi-step software development processes
- Web-based workflow automation: The model needs to navigate sites, fill forms, and extract data across sessions
Consider Alternatives When
- Highest-volume throughput: Claude 3.5 Haiku is faster and cheaper for tasks that don't require computer use or deep coding
- Deterministic automation: Anthropic explicitly described computer use as experimental and sometimes cumbersome
- Text-only generation: Tasks without agentic complexity don't benefit from computer use capabilities
- Extended thinking mode: Hard reasoning problems arrived with Claude 3.7 Sonnet
Conclusion
Claude 3.5 Sonnet (October 2024) marked an inflection point: it posted strong real-world software engineering benchmark results at release and was the first to make computer interaction available to developers via an API. Teams building agentic pipelines that involve actual software interfaces, not just code generation, have a concrete reason to evaluate this version.
Frequently Asked Questions
What is computer use and how does it work in Claude 3.5 Sonnet?
Computer use is an API capability that lets Claude interact with computers as people do, perceiving screen state via screenshots, moving a cursor, clicking buttons, and typing. Developers integrate the API and pass instructions like "fill out this form using data from my spreadsheet," which Claude translates into individual computer commands.
Was computer use production-ready?
No. Anthropic explicitly described it as experimental at the October 2024 launch: capable but at times cumbersome and error-prone. They released it early to gather developer feedback, expecting rapid improvement.
How much did SWE-bench Verified improve between the June and October 2024 Claude 3.5 Sonnet versions?
The October upgrade moved the score from 33.4% to 49.0%, which Anthropic stated was higher than all publicly available models at that time, including other high-performing reasoning models and specialized agentic coding systems.
Did the October 2024 upgrade change Claude 3.5 Sonnet's pricing?
No. Anthropic released the upgraded model at the same price as its predecessor. Input, output, and context window specs remained consistent.
What companies were using computer use?
Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company were building with computer use at launch. Replit used it to evaluate apps during construction. The Browser Company applied it to web-based workflow automation.
How does this version differ from the June 2024 Claude 3.5 Sonnet (claude-3.5-sonnet-20240620)?
The October upgrade added computer use capabilities and significantly improved coding and tool use benchmarks. The June version lacked computer use entirely and had lower SWE-bench scores. Both versions share the same model family name but are distinct checkpoints.
What tool use improvements came with this upgrade?
TAU-bench tool use scores improved from 62.6% to 69.2% in the retail domain and from 36.0% to 46.0% in the airline domain, reflecting gains in handling structured multi-step agentic interactions.