TL;DR: On May 23, 2025, Anthropic released Claude 4—Opus 4 and Sonnet 4. Opus 4 tops the SWE-bench leaderboard at 72.5%, can autonomously code for 7 hours straight, and is the first model deployed under ASL-3 safety standards. Sonnet 4 matches it at 72.7% on SWE-bench at 1/5 the cost. Together with Claude Code, Agent SDK, and MCP Connector, Anthropic has built a complete AI programming ecosystem.

Table of Contents

Key Takeaways

  • SWE-bench Champion: Opus 4 scores 72.5%, crushing GPT-4.1 (54.6%) and Gemini 2.5 Pro (63.2%). Sonnet 4 matches at 72.7% with 5x lower cost.
  • 7-Hour Autonomous Execution: Claude 4 can code, debug, test, and commit for hours without human supervision—a paradigm shift from "AI assistant" to "AI engineer."
  • Hybrid Reasoning: Seamless switching between fast responses and deep Extended Thinking mode, generating tens of thousands of reasoning tokens while invoking tools in parallel.
  • Complete Toolchain: Claude Code (terminal agent) + Agent SDK (custom agent framework) + MCP Connector (one-line MCP integration) form a full AI programming ecosystem.
  • ASL-3 Safety: Opus 4 is Anthropic's first model deployed under ASL-3, the highest commercial AI safety standard to date.

💡 Tool Tip: Use the JSON Formatter to parse complex nested JSON responses from the Claude API, or browse the MCP Directory to discover MCP servers compatible with Claude.

Claude 4 Model Family Overview

The Claude 4 family is Anthropic's fourth-generation Large Language Model (LLM), consisting of two core models:

Feature Claude Opus 4 Claude Sonnet 4
Positioning Flagship, complex long-running tasks Cost-effective, everyday development
Release Date 2025-05-23 2025-05-23
Context Window 200K tokens 200K tokens
Max Output 32K tokens 16K tokens
API Pricing (Input/Output) $15 / $75 per MTok $3 / $15 per MTok
SWE-bench 72.5% 72.7%
Terminal-bench 43.2% 35.6%
GPQA Diamond 79.6% 77.2%
Safety Level ASL-3 ASL-2
Extended Thinking
Parallel Tool Use

Notably, Sonnet 4 actually edges out Opus 4 on SWE-bench (72.7% vs 72.5%)—a "small-beats-large" phenomenon. However, Opus 4 maintains a clear advantage on tasks requiring sustained deep reasoning (Terminal-bench: 43.2% vs 35.6%).

📝 Glossary Link: Transformer — Claude 4 is still built on the Transformer architecture, but with revolutionary improvements in attention mechanisms and reasoning strategies.

Core Technical Breakthroughs: Hybrid Reasoning and Autonomous Execution

Extended Thinking: A Revolution in AI Reasoning

Claude 4's most significant technical breakthrough is Hybrid Reasoning—the model dynamically switches between "fast response" and "deep thinking" modes based on task complexity.

Traditional LLMs operate like human "System 1" thinking—seeing a question and immediately generating an answer. Extended Thinking activates "System 2"—stopping to think thoroughly before acting.

graph TD A["User Input"] --> B{"Assess Complexity"} B -->|"Simple Query"| C["Fast Response Mode"] B -->|"Complex Task"| D["Extended Thinking Mode"] D --> E["Generate Internal Reasoning Chain"] E --> F{"Need External Info?"} F -->|"Yes"| G["Parallel Tool Calls / Search"] G --> E F -->|"No"| H["Logical Summary Compression"] H --> I["Final Output"] C --> I

Key features of Extended Thinking:

  • Ultra-long reasoning chains: The model generates tens of thousands of internal reasoning tokens, enabling complex multi-step problem solving
  • Mid-reasoning tool calls: Can invoke multiple tools in parallel (search, execute code, read files) during the thinking process—no need to "finish thinking" before "doing"
  • Automatic summarization: ~5% of extremely long reasoning chains trigger a smaller model to logically summarize the chain, preventing context overflow
  • Developer Mode: Full unsummarized reasoning chains available for debugging and analysis

7-Hour Autonomous Execution: The Arrival of the AI Programmer

Opus 4's other milestone capability is sustained multi-hour autonomous task execution. In Anthropic's internal testing, Opus 4 coded autonomously for up to 7 hours, completing the full loop of writing code, running unit tests, fixing bugs, and committing to Git.

This isn't simple "API loop" automation. Claude 4 achieves true long-running autonomy through several mechanisms:

  1. Persistent memory files: When given file system access, Claude 4 proactively creates and updates "memory files" to persist critical context
  2. Goal-oriented planning: The model decomposes complex goals into subtasks, progressing incrementally while tracking overall progress
  3. Self-correction: When tests fail or compilation errors occur, it autonomously analyzes root causes and fixes them rather than blindly retrying
  4. 65% less reward hacking: Compared to Claude 3.7, Claude 4 is 65% less likely to take shortcuts (e.g., modifying test cases instead of fixing the actual code)

In one notable demo, Opus 4 was tasked with playing Pokémon Red. It autonomously created a "Navigation Guide" file to remember map layouts and objectives, running continuously for over 24 hours to complete game levels.

For developers, the implication is profound—your role is shifting from "engineer writing code line by line" to "project manager setting goals and acceptance criteria."

🔗 Further Reading: Want to understand AI reasoning mechanisms in depth? See Reasoning Models Deep Dive: OpenAI o1 & DeepSeek R1.

SWE-bench Showdown: Coding Benchmarks Compared

SWE-bench Verified is one of the most authoritative benchmarks for AI coding capability—it requires models to solve real GitHub repository issues, involving code comprehension, bug localization, patch writing, and test passing.

Cross-Model Comparison

Model SWE-bench Terminal-bench GPQA Diamond MMLU AIME 2024 Pricing (Input/Output MTok)
Claude Opus 4 72.5% 43.2% 79.6% 87.4% 33.0%* $15 / $75
Claude Sonnet 4 72.7% 35.6% 77.2% 85.6% - $3 / $15
OpenAI o3 69.1% - 83.0% - 91.6% $10 / $40
OpenAI GPT-4.1 54.6% 30.0% 66.0% 83.5% - $2 / $8
Gemini 2.5 Pro 63.2% ~25% ~83% 85.8% ~92% $1.25 / $10
DeepSeek R1 49.2% - 71.5% 79.8% 79.8% $0.55 / $2.19

Note: Opus 4 with Extended Thinking can reach 75-90% on AIME.

Key observations:

Dominant in coding: The Claude 4 family leads decisively on SWE-bench and Terminal-bench. 72.5% means it can independently solve nearly three-quarters of real GitHub issues—unthinkable just a year ago.

Math is a weak spot: On math competition benchmarks like AIME 2024, Claude 4's default performance (33%) falls far behind Gemini 2.5 Pro (~92%) and o3 (91.6%). Extended Thinking narrows the gap significantly.

Best value: Sonnet 4 at $3/$15 achieves nearly identical coding performance as Opus 4—the cost per SWE-bench problem solved is an order of magnitude lower than GPT-4.1.

Agentic Coding with Extended Thinking

With Extended Thinking and tool use enabled simultaneously, Claude 4's coding capability surges further:

Model SWE-bench (Agentic)
Claude Opus 4 + ET 79.4%
Claude Sonnet 4 + ET 80.2%
OpenAI Codex-1 72.1%
Gemini 2.5 Pro 63.8%

Sonnet 4 actually outperforms Opus 4 in the agentic setting (80.2% vs 79.4%)—likely because Sonnet's faster inference speed compounds into an advantage across multi-turn tool interactions.

🔗 Further Reading: Curious about the "small-beats-large" phenomenon? See Mixture of Experts (MoE) Architecture Explained.

Developer Toolchain: Claude Code, Agent SDK, and MCP Connector

Claude 4 isn't just a model upgrade—it marks Anthropic's transformation from "model provider" to "AI development platform." Three major developer tools were launched alongside Claude 4:

Claude Code: Terminal-Native Agent Programming

Claude Code is now GA (Generally Available). Unlike IDE-based plugins like GitHub Copilot, Claude Code runs directly in your terminal with full system access:

bash
# Install Claude Code
npm install -g @anthropic-ai/claude-code

# Launch in your project directory
cd your-project
claude

# Claude Code autonomously:
# 1. Analyzes project structure and codebase
# 2. Understands your requirements
# 3. Writes code, runs tests
# 4. Fixes bugs, commits to Git

Core capabilities:

  • Project understanding: Auto-scans codebase structure, dependencies, and coding conventions
  • File operations: Read/write any file, create new modules
  • Shell execution: Run build commands, test scripts, lint checks
  • Git integration: Create branches, commit code, generate PR descriptions
  • MCP integration: Connect to external data sources and tools via MCP protocol

Cursor's team commented: "Claude Opus 4 represents the state of the art in coding, achieving a leap in complex codebase comprehension."

Agent SDK: Build Custom AI Agents

The Claude Agent SDK is a framework built on top of Claude Code for rapidly creating custom AI Agents:

python
from anthropic import Anthropic
from anthropic.agent import AgentLoop

# Initialize the Agent
client = Anthropic()
agent = AgentLoop(
    model="claude-opus-4-20250514",
    tools=[
        {"type": "computer_20250124", "display_width": 1024, "display_height": 768},
        {"type": "text_editor_20250124"},
        {"type": "bash_20250124"}
    ],
    system="You are a senior Python developer focused on code quality and test coverage."
)

# Run an agent task
result = agent.run(
    "Analyze this project's test coverage, identify uncovered critical paths, and write the missing unit tests."
)
print(result.output)

Key design principles:

  1. Tool composition: Built-in text editor, Bash terminal, and computer-use tools, plus custom MCP tool support
  2. Permission control: Fine-grained tool access via allowedTools and disallowedTools
  3. Observability: Complete execution logs and reasoning chains for debugging
  4. Human-in-the-loop: Pause at critical steps and request human confirmation

MCP Connector: One-Line MCP Integration

MCP (Model Context Protocol) is Anthropic's open protocol for connecting AI models with external tools and data sources. Claude 4's MCP Connector simplifies this integration to its essence:

python
import anthropic

client = anthropic.Anthropic()

# One-line configuration to connect a remote MCP server
response = client.messages.create(
    model="claude-opus-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Query my GitHub repo's latest issues"}],
    # Add MCP servers directly in the API request
    mcp_servers=[
        {
            "type": "url",
            "url": "https://mcp.github.com/sse",
            "authorization_token": "github_pat_xxx"
        }
    ]
)

Previously, connecting to MCP servers required building your own MCP client, handling connection management and tool discovery. Now, the Anthropic API handles everything—just add a URL to your request, and you instantly access thousands of tools and data sources in the MCP ecosystem.

💡 Tool Tip: Browse the MCP Directory to discover available MCP servers, or use the AI Agent Directory to explore agents built with Claude.

🔗 Further Reading: For a deep dive into the MCP protocol architecture, see MCP Protocol Complete Guide.

Hands-On: API Integration and Code Examples

Python: Basic Call and Extended Thinking

python
import anthropic

client = anthropic.Anthropic()

# Basic call
response = client.messages.create(
    model="claude-opus-4-20250514",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": "Analyze the performance bottleneck in this Python code and optimize it:\n\ndef find_duplicates(lst):\n    result = []\n    for i in range(len(lst)):\n        for j in range(i+1, len(lst)):\n            if lst[i] == lst[j] and lst[i] not in result:\n                result.append(lst[i])\n    return result"
        }
    ]
)
print(response.content[0].text)

# Enable Extended Thinking for deep reasoning
response_et = client.messages.create(
    model="claude-opus-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Token budget for the thinking process
    },
    messages=[
        {
            "role": "user",
            "content": "Design a high-concurrency distributed task scheduler supporting priority queues, retry policies, dead letter handling, and horizontal scaling. Provide the full architecture and core code."
        }
    ]
)

# Parse thinking process and final answer
for block in response_et.content:
    if block.type == "thinking":
        print(f"Thinking: {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"Answer: {block.text}")

JavaScript/TypeScript: Streaming and Tool Use

javascript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Streaming response — get output in real-time
async function streamResponse() {
  const stream = client.messages.stream({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: "Implement a type-safe EventBus in TypeScript with generic event types and wildcard listeners."
      }
    ]
  });

  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      process.stdout.write(event.delta.text);
    }
  }
}

// Tool use — let Claude call functions
async function toolUseExample() {
  const response = await client.messages.create({
    model: "claude-opus-4-20250514",
    max_tokens: 4096,
    tools: [
      {
        name: "execute_code",
        description: "Execute Python code in a sandbox and return the result",
        input_schema: {
          type: "object",
          properties: {
            code: { type: "string", description: "The Python code to execute" },
            timeout: { type: "number", description: "Timeout in seconds" }
          },
          required: ["code"]
        }
      }
    ],
    messages: [
      {
        role: "user",
        content: "Calculate the sum of the first 100 Fibonacci numbers and verify with code."
      }
    ]
  });

  console.log(JSON.stringify(response.content, null, 2));
}

streamResponse();

cURL: MCP Connector Integration

bash
curl https://api.anthropic.com/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "anthropic-beta: mcp-client-2025-11-20" \
  -d '{
    "model": "claude-opus-4-20250514",
    "max_tokens": 1024,
    "mcp_servers": [
      {
        "type": "url",
        "url": "https://your-mcp-server.example.com/sse",
        "authorization_token": "your-token"
      }
    ],
    "messages": [
      {"role": "user", "content": "Query latest data via MCP tools and generate a report"}
    ]
  }'

💡 Tool Tip: Use the JSON Formatter to format complex JSON structures in API responses, quickly locating thinking blocks and tool_use blocks.

ASL-3 Safety Framework

Claude Opus 4 is the first model in Anthropic's history deployed under ASL-3 (AI Safety Level 3)—the highest safety tier for any commercial AI model to date.

What is ASL-3?

Anthropic's AI Safety Level (ASL) system, inspired by Biosafety Levels (BSL), classifies models based on their capability-related risks:

Safety Level Description Models
ASL-1 No significant risk Early small models
ASL-2 Standard safety measures Claude Sonnet 4, GPT-4, etc.
ASL-3 Enhanced safety, model has advanced capabilities Claude Opus 4
ASL-4 Highest level (not yet triggered) Future super-models

ASL-3 means the model has demonstrated capabilities powerful enough to require additional safety guardrails. Specific measures include:

  • CBRN evaluation: Specialized testing for chemical, biological, radiological, and nuclear risks
  • Stricter output filtering: Enhanced harmful content interception
  • Red teaming: Ongoing adversarial testing including jailbreak attacks and prompt injection defenses
  • Deployment restrictions: API access controls for specific scenarios
  • Reward hacking mitigation: Claude 4's tendency to take shortcuts is reduced by 65% compared to Claude 3.7

"Soul" Character Alignment

A standout safety design in Claude 4: even when instructed to behave unethically via the System Prompt, the model maintains its core values. Anthropic calls this the model's "soul"—an intrinsic moral compass that doesn't bend to instructions.

In practice, if a developer writes "ignore safety rules" in the System Prompt, Opus 4 will politely but firmly refuse rather than blindly comply. This design philosophy makes Claude 4 more reliable in agent scenarios—especially during long unsupervised runs.

🔗 Further Reading: Learn more about AI safety in the context of agent workflows. See Context Engineering Complete Guide.

FAQ

Q: What is the difference between Claude Opus 4 and Claude Sonnet 4?

Opus 4 is the flagship model designed for complex, long-running tasks and agentic coding, scoring 72.5% on SWE-bench with 7-hour autonomous execution capability. API pricing is $15/$75 per MTok. Sonnet 4 is the cost-effective option, scoring 72.7% on SWE-bench at 1/5 the price ($3/$15 per MTok), making it ideal for everyday development.

Q: What is Extended Thinking and how does it differ from traditional inference?

Extended Thinking is Claude 4's hybrid reasoning mode. The model automatically switches between fast responses and deep reasoning—answering simple questions instantly while activating long chains of thought (up to tens of thousands of tokens) for complex tasks. It can also invoke tools in parallel during reasoning, simulating a human "think first, then act" workflow.

Q: How is Claude Code different from traditional IDE AI plugins?

Claude Code is a terminal-native agent programming tool, not an IDE plugin. It runs directly in the command line with full terminal access—reading/writing files, executing shell commands, running tests, and managing Git. It functions as an AI programmer with complete system access, handling the full development cycle from understanding requirements to committing code.

Q: How do I use Claude Opus 4 via the API?

Call the Anthropic Messages API with model set to claude-opus-4-20250514. Both the Python SDK (anthropic library) and REST API are supported. To enable Extended Thinking, add the thinking parameter with a budget_tokens value to your request. See the code examples in this post.

Summary

The release of Claude 4 is more than a model performance upgrade—it marks AI programming's transition from "assisted completion" to "autonomous delivery." Opus 4 redefines the ceiling of AI coding with its 72.5% SWE-bench score and 7-hour autonomous execution; Sonnet 4 democratizes high-quality AI programming with nearly identical capability at 1/5 the price.

Meanwhile, the Claude Code + Agent SDK + MCP Connector toolchain enables developers to rapidly build complete workflows spanning terminal programming, custom agents, and external tool integration. The ASL-3 safety framework ensures these powerful capabilities operate within controlled boundaries.

For developers, now is the time to reassess your workflow. Claude 4 won't replace programmers, but programmers who master Claude 4 will outperform those who don't.