TL;DR: On May 23, 2025, Anthropic released Claude 4—Opus 4 and Sonnet 4. Opus 4 tops the SWE-bench leaderboard at 72.5%, can autonomously code for 7 hours straight, and is the first model deployed under ASL-3 safety standards. Sonnet 4 matches it at 72.7% on SWE-bench at 1/5 the cost. Together with Claude Code, Agent SDK, and MCP Connector, Anthropic has built a complete AI programming ecosystem.
Table of Contents
- Claude 4 Model Family Overview
- Core Technical Breakthroughs: Hybrid Reasoning and Autonomous Execution
- SWE-bench Showdown: Coding Benchmarks Compared
- Developer Toolchain: Claude Code, Agent SDK, and MCP Connector
- Hands-On: API Integration and Code Examples
- ASL-3 Safety Framework
- FAQ
- Summary
- Related Resources
Key Takeaways
- SWE-bench Champion: Opus 4 scores 72.5%, crushing GPT-4.1 (54.6%) and Gemini 2.5 Pro (63.2%). Sonnet 4 matches at 72.7% with 5x lower cost.
- 7-Hour Autonomous Execution: Claude 4 can code, debug, test, and commit for hours without human supervision—a paradigm shift from "AI assistant" to "AI engineer."
- Hybrid Reasoning: Seamless switching between fast responses and deep Extended Thinking mode, generating tens of thousands of reasoning tokens while invoking tools in parallel.
- Complete Toolchain: Claude Code (terminal agent) + Agent SDK (custom agent framework) + MCP Connector (one-line MCP integration) form a full AI programming ecosystem.
- ASL-3 Safety: Opus 4 is Anthropic's first model deployed under ASL-3, the highest commercial AI safety standard to date.
💡 Tool Tip: Use the JSON Formatter to parse complex nested JSON responses from the Claude API, or browse the MCP Directory to discover MCP servers compatible with Claude.
Claude 4 Model Family Overview
The Claude 4 family is Anthropic's fourth-generation Large Language Model (LLM), consisting of two core models:
| Feature | Claude Opus 4 | Claude Sonnet 4 |
|---|---|---|
| Positioning | Flagship, complex long-running tasks | Cost-effective, everyday development |
| Release Date | 2025-05-23 | 2025-05-23 |
| Context Window | 200K tokens | 200K tokens |
| Max Output | 32K tokens | 16K tokens |
| API Pricing (Input/Output) | $15 / $75 per MTok | $3 / $15 per MTok |
| SWE-bench | 72.5% | 72.7% |
| Terminal-bench | 43.2% | 35.6% |
| GPQA Diamond | 79.6% | 77.2% |
| Safety Level | ASL-3 | ASL-2 |
| Extended Thinking | ✅ | ✅ |
| Parallel Tool Use | ✅ | ✅ |
Notably, Sonnet 4 actually edges out Opus 4 on SWE-bench (72.7% vs 72.5%)—a "small-beats-large" phenomenon. However, Opus 4 maintains a clear advantage on tasks requiring sustained deep reasoning (Terminal-bench: 43.2% vs 35.6%).
📝 Glossary Link: Transformer — Claude 4 is still built on the Transformer architecture, but with revolutionary improvements in attention mechanisms and reasoning strategies.
Core Technical Breakthroughs: Hybrid Reasoning and Autonomous Execution
Extended Thinking: A Revolution in AI Reasoning
Claude 4's most significant technical breakthrough is Hybrid Reasoning—the model dynamically switches between "fast response" and "deep thinking" modes based on task complexity.
Traditional LLMs operate like human "System 1" thinking—seeing a question and immediately generating an answer. Extended Thinking activates "System 2"—stopping to think thoroughly before acting.
Key features of Extended Thinking:
- Ultra-long reasoning chains: The model generates tens of thousands of internal reasoning tokens, enabling complex multi-step problem solving
- Mid-reasoning tool calls: Can invoke multiple tools in parallel (search, execute code, read files) during the thinking process—no need to "finish thinking" before "doing"
- Automatic summarization: ~5% of extremely long reasoning chains trigger a smaller model to logically summarize the chain, preventing context overflow
- Developer Mode: Full unsummarized reasoning chains available for debugging and analysis
7-Hour Autonomous Execution: The Arrival of the AI Programmer
Opus 4's other milestone capability is sustained multi-hour autonomous task execution. In Anthropic's internal testing, Opus 4 coded autonomously for up to 7 hours, completing the full loop of writing code, running unit tests, fixing bugs, and committing to Git.
This isn't simple "API loop" automation. Claude 4 achieves true long-running autonomy through several mechanisms:
- Persistent memory files: When given file system access, Claude 4 proactively creates and updates "memory files" to persist critical context
- Goal-oriented planning: The model decomposes complex goals into subtasks, progressing incrementally while tracking overall progress
- Self-correction: When tests fail or compilation errors occur, it autonomously analyzes root causes and fixes them rather than blindly retrying
- 65% less reward hacking: Compared to Claude 3.7, Claude 4 is 65% less likely to take shortcuts (e.g., modifying test cases instead of fixing the actual code)
In one notable demo, Opus 4 was tasked with playing Pokémon Red. It autonomously created a "Navigation Guide" file to remember map layouts and objectives, running continuously for over 24 hours to complete game levels.
For developers, the implication is profound—your role is shifting from "engineer writing code line by line" to "project manager setting goals and acceptance criteria."
🔗 Further Reading: Want to understand AI reasoning mechanisms in depth? See Reasoning Models Deep Dive: OpenAI o1 & DeepSeek R1.
SWE-bench Showdown: Coding Benchmarks Compared
SWE-bench Verified is one of the most authoritative benchmarks for AI coding capability—it requires models to solve real GitHub repository issues, involving code comprehension, bug localization, patch writing, and test passing.
Cross-Model Comparison
| Model | SWE-bench | Terminal-bench | GPQA Diamond | MMLU | AIME 2024 | Pricing (Input/Output MTok) |
|---|---|---|---|---|---|---|
| Claude Opus 4 | 72.5% | 43.2% | 79.6% | 87.4% | 33.0%* | $15 / $75 |
| Claude Sonnet 4 | 72.7% | 35.6% | 77.2% | 85.6% | - | $3 / $15 |
| OpenAI o3 | 69.1% | - | 83.0% | - | 91.6% | $10 / $40 |
| OpenAI GPT-4.1 | 54.6% | 30.0% | 66.0% | 83.5% | - | $2 / $8 |
| Gemini 2.5 Pro | 63.2% | ~25% | ~83% | 85.8% | ~92% | $1.25 / $10 |
| DeepSeek R1 | 49.2% | - | 71.5% | 79.8% | 79.8% | $0.55 / $2.19 |
Note: Opus 4 with Extended Thinking can reach 75-90% on AIME.
Key observations:
Dominant in coding: The Claude 4 family leads decisively on SWE-bench and Terminal-bench. 72.5% means it can independently solve nearly three-quarters of real GitHub issues—unthinkable just a year ago.
Math is a weak spot: On math competition benchmarks like AIME 2024, Claude 4's default performance (33%) falls far behind Gemini 2.5 Pro (~92%) and o3 (91.6%). Extended Thinking narrows the gap significantly.
Best value: Sonnet 4 at $3/$15 achieves nearly identical coding performance as Opus 4—the cost per SWE-bench problem solved is an order of magnitude lower than GPT-4.1.
Agentic Coding with Extended Thinking
With Extended Thinking and tool use enabled simultaneously, Claude 4's coding capability surges further:
| Model | SWE-bench (Agentic) |
|---|---|
| Claude Opus 4 + ET | 79.4% |
| Claude Sonnet 4 + ET | 80.2% |
| OpenAI Codex-1 | 72.1% |
| Gemini 2.5 Pro | 63.8% |
Sonnet 4 actually outperforms Opus 4 in the agentic setting (80.2% vs 79.4%)—likely because Sonnet's faster inference speed compounds into an advantage across multi-turn tool interactions.
🔗 Further Reading: Curious about the "small-beats-large" phenomenon? See Mixture of Experts (MoE) Architecture Explained.
Developer Toolchain: Claude Code, Agent SDK, and MCP Connector
Claude 4 isn't just a model upgrade—it marks Anthropic's transformation from "model provider" to "AI development platform." Three major developer tools were launched alongside Claude 4:
Claude Code: Terminal-Native Agent Programming
Claude Code is now GA (Generally Available). Unlike IDE-based plugins like GitHub Copilot, Claude Code runs directly in your terminal with full system access:
# Install Claude Code
npm install -g @anthropic-ai/claude-code
# Launch in your project directory
cd your-project
claude
# Claude Code autonomously:
# 1. Analyzes project structure and codebase
# 2. Understands your requirements
# 3. Writes code, runs tests
# 4. Fixes bugs, commits to Git
Core capabilities:
- Project understanding: Auto-scans codebase structure, dependencies, and coding conventions
- File operations: Read/write any file, create new modules
- Shell execution: Run build commands, test scripts, lint checks
- Git integration: Create branches, commit code, generate PR descriptions
- MCP integration: Connect to external data sources and tools via MCP protocol
Cursor's team commented: "Claude Opus 4 represents the state of the art in coding, achieving a leap in complex codebase comprehension."
Agent SDK: Build Custom AI Agents
The Claude Agent SDK is a framework built on top of Claude Code for rapidly creating custom AI Agents:
from anthropic import Anthropic
from anthropic.agent import AgentLoop
# Initialize the Agent
client = Anthropic()
agent = AgentLoop(
model="claude-opus-4-20250514",
tools=[
{"type": "computer_20250124", "display_width": 1024, "display_height": 768},
{"type": "text_editor_20250124"},
{"type": "bash_20250124"}
],
system="You are a senior Python developer focused on code quality and test coverage."
)
# Run an agent task
result = agent.run(
"Analyze this project's test coverage, identify uncovered critical paths, and write the missing unit tests."
)
print(result.output)
Key design principles:
- Tool composition: Built-in text editor, Bash terminal, and computer-use tools, plus custom MCP tool support
- Permission control: Fine-grained tool access via
allowedToolsanddisallowedTools - Observability: Complete execution logs and reasoning chains for debugging
- Human-in-the-loop: Pause at critical steps and request human confirmation
MCP Connector: One-Line MCP Integration
MCP (Model Context Protocol) is Anthropic's open protocol for connecting AI models with external tools and data sources. Claude 4's MCP Connector simplifies this integration to its essence:
import anthropic
client = anthropic.Anthropic()
# One-line configuration to connect a remote MCP server
response = client.messages.create(
model="claude-opus-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Query my GitHub repo's latest issues"}],
# Add MCP servers directly in the API request
mcp_servers=[
{
"type": "url",
"url": "https://mcp.github.com/sse",
"authorization_token": "github_pat_xxx"
}
]
)
Previously, connecting to MCP servers required building your own MCP client, handling connection management and tool discovery. Now, the Anthropic API handles everything—just add a URL to your request, and you instantly access thousands of tools and data sources in the MCP ecosystem.
💡 Tool Tip: Browse the MCP Directory to discover available MCP servers, or use the AI Agent Directory to explore agents built with Claude.
🔗 Further Reading: For a deep dive into the MCP protocol architecture, see MCP Protocol Complete Guide.
Hands-On: API Integration and Code Examples
Python: Basic Call and Extended Thinking
import anthropic
client = anthropic.Anthropic()
# Basic call
response = client.messages.create(
model="claude-opus-4-20250514",
max_tokens=4096,
messages=[
{
"role": "user",
"content": "Analyze the performance bottleneck in this Python code and optimize it:\n\ndef find_duplicates(lst):\n result = []\n for i in range(len(lst)):\n for j in range(i+1, len(lst)):\n if lst[i] == lst[j] and lst[i] not in result:\n result.append(lst[i])\n return result"
}
]
)
print(response.content[0].text)
# Enable Extended Thinking for deep reasoning
response_et = client.messages.create(
model="claude-opus-4-20250514",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Token budget for the thinking process
},
messages=[
{
"role": "user",
"content": "Design a high-concurrency distributed task scheduler supporting priority queues, retry policies, dead letter handling, and horizontal scaling. Provide the full architecture and core code."
}
]
)
# Parse thinking process and final answer
for block in response_et.content:
if block.type == "thinking":
print(f"Thinking: {block.thinking[:200]}...")
elif block.type == "text":
print(f"Answer: {block.text}")
JavaScript/TypeScript: Streaming and Tool Use
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
// Streaming response — get output in real-time
async function streamResponse() {
const stream = client.messages.stream({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
messages: [
{
role: "user",
content: "Implement a type-safe EventBus in TypeScript with generic event types and wildcard listeners."
}
]
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
}
// Tool use — let Claude call functions
async function toolUseExample() {
const response = await client.messages.create({
model: "claude-opus-4-20250514",
max_tokens: 4096,
tools: [
{
name: "execute_code",
description: "Execute Python code in a sandbox and return the result",
input_schema: {
type: "object",
properties: {
code: { type: "string", description: "The Python code to execute" },
timeout: { type: "number", description: "Timeout in seconds" }
},
required: ["code"]
}
}
],
messages: [
{
role: "user",
content: "Calculate the sum of the first 100 Fibonacci numbers and verify with code."
}
]
});
console.log(JSON.stringify(response.content, null, 2));
}
streamResponse();
cURL: MCP Connector Integration
curl https://api.anthropic.com/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: mcp-client-2025-11-20" \
-d '{
"model": "claude-opus-4-20250514",
"max_tokens": 1024,
"mcp_servers": [
{
"type": "url",
"url": "https://your-mcp-server.example.com/sse",
"authorization_token": "your-token"
}
],
"messages": [
{"role": "user", "content": "Query latest data via MCP tools and generate a report"}
]
}'
💡 Tool Tip: Use the JSON Formatter to format complex JSON structures in API responses, quickly locating
thinkingblocks andtool_useblocks.
ASL-3 Safety Framework
Claude Opus 4 is the first model in Anthropic's history deployed under ASL-3 (AI Safety Level 3)—the highest safety tier for any commercial AI model to date.
What is ASL-3?
Anthropic's AI Safety Level (ASL) system, inspired by Biosafety Levels (BSL), classifies models based on their capability-related risks:
| Safety Level | Description | Models |
|---|---|---|
| ASL-1 | No significant risk | Early small models |
| ASL-2 | Standard safety measures | Claude Sonnet 4, GPT-4, etc. |
| ASL-3 | Enhanced safety, model has advanced capabilities | Claude Opus 4 |
| ASL-4 | Highest level (not yet triggered) | Future super-models |
ASL-3 means the model has demonstrated capabilities powerful enough to require additional safety guardrails. Specific measures include:
- CBRN evaluation: Specialized testing for chemical, biological, radiological, and nuclear risks
- Stricter output filtering: Enhanced harmful content interception
- Red teaming: Ongoing adversarial testing including jailbreak attacks and prompt injection defenses
- Deployment restrictions: API access controls for specific scenarios
- Reward hacking mitigation: Claude 4's tendency to take shortcuts is reduced by 65% compared to Claude 3.7
"Soul" Character Alignment
A standout safety design in Claude 4: even when instructed to behave unethically via the System Prompt, the model maintains its core values. Anthropic calls this the model's "soul"—an intrinsic moral compass that doesn't bend to instructions.
In practice, if a developer writes "ignore safety rules" in the System Prompt, Opus 4 will politely but firmly refuse rather than blindly comply. This design philosophy makes Claude 4 more reliable in agent scenarios—especially during long unsupervised runs.
🔗 Further Reading: Learn more about AI safety in the context of agent workflows. See Context Engineering Complete Guide.
FAQ
Q: What is the difference between Claude Opus 4 and Claude Sonnet 4?
Opus 4 is the flagship model designed for complex, long-running tasks and agentic coding, scoring 72.5% on SWE-bench with 7-hour autonomous execution capability. API pricing is $15/$75 per MTok. Sonnet 4 is the cost-effective option, scoring 72.7% on SWE-bench at 1/5 the price ($3/$15 per MTok), making it ideal for everyday development.
Q: What is Extended Thinking and how does it differ from traditional inference?
Extended Thinking is Claude 4's hybrid reasoning mode. The model automatically switches between fast responses and deep reasoning—answering simple questions instantly while activating long chains of thought (up to tens of thousands of tokens) for complex tasks. It can also invoke tools in parallel during reasoning, simulating a human "think first, then act" workflow.
Q: How is Claude Code different from traditional IDE AI plugins?
Claude Code is a terminal-native agent programming tool, not an IDE plugin. It runs directly in the command line with full terminal access—reading/writing files, executing shell commands, running tests, and managing Git. It functions as an AI programmer with complete system access, handling the full development cycle from understanding requirements to committing code.
Q: How do I use Claude Opus 4 via the API?
Call the Anthropic Messages API with model set to claude-opus-4-20250514. Both the Python SDK (anthropic library) and REST API are supported. To enable Extended Thinking, add the thinking parameter with a budget_tokens value to your request. See the code examples in this post.
Summary
The release of Claude 4 is more than a model performance upgrade—it marks AI programming's transition from "assisted completion" to "autonomous delivery." Opus 4 redefines the ceiling of AI coding with its 72.5% SWE-bench score and 7-hour autonomous execution; Sonnet 4 democratizes high-quality AI programming with nearly identical capability at 1/5 the price.
Meanwhile, the Claude Code + Agent SDK + MCP Connector toolchain enables developers to rapidly build complete workflows spanning terminal programming, custom agents, and external tool integration. The ASL-3 safety framework ensures these powerful capabilities operate within controlled boundaries.
For developers, now is the time to reassess your workflow. Claude 4 won't replace programmers, but programmers who master Claude 4 will outperform those who don't.
Related Resources
- JSON Formatter — Parse complex JSON responses from the Claude API
- MCP Directory — Discover MCP servers compatible with Claude
- AI Directory — Explore the latest AI development tools and models
- Agent Directory — Browse agents built on Claude
- Reasoning Models Deep Dive — Understand the reasoning architectures of o1 and DeepSeek R1
- MoE Architecture Explained — Learn about the Mixture of Experts architecture
- MCP Protocol Complete Guide — Deep dive into MCP protocol implementation
- Context Engineering Complete Guide — Master best practices for finite context windows