TL;DR
GPT-5.5 (codename "Spud"), released April 23, 2026, is OpenAI's first fully retrained base model since GPT-4.5. It introduces a natively omnimodal architecture processing text, images, audio, and video in a single unified model—not a pipeline of specialists. Built on Sparse Mixture-of-Experts (MoE) with dynamic activation routing only 8–15% of expert modules per inference, co-designed for NVIDIA GB300 NVL72 rack-scale hardware, GPT-5.5 delivers a 1.05M token context window, scores 85.0% on ARC-AGI-2, and achieves 93.6% on GPQA Diamond. Its three-layer agentic architecture (Planner → Executor → Reflector) with Dynamic Inference Pathways makes it OpenAI's flagship model for autonomous AI agents—though it still trails Claude Opus 4 on pure coding benchmarks.
Table of Contents
- TL;DR
- Key Takeaways
- GPT-5 Family Overview
- Architecture: Sparse MoE with Dynamic Activation
- Natively Omnimodal: One Model, All Modalities
- Context Window: 1 Million Tokens in Production
- Agentic Architecture: Planner-Executor-Reflector
- Dynamic Inference Pathways and Reasoning Effort
- Benchmark Analysis
- Hardware Co-Design: NVIDIA GB200/GB300 NVL72
- API Integration: Python and JavaScript Examples
- Pricing and Economics
- GPT-5.5 vs Claude Opus 4: Head-to-Head
- FAQ
- Summary
- Related Resources
Key Takeaways
- First Full Retrain Since GPT-4.5: All GPT-5.1 through 5.4 models were post-training iterations. GPT-5.5 is a ground-up retrained base model with fundamentally new architecture.
- Sparse MoE at Scale: Only 8–15% of expert modules activate per inference token, enabling massive total parameter counts with manageable compute costs.
- True Omnimodal: Text, image, audio, and video are processed in a single unified architecture—no separate encoder pipelines stitched together.
- 1.05M Token Context: The longest production context window available, with MRCR v2 accuracy jumping from 36.6% (GPT-5.4) to 74.0%.
- Agentic-First Design: A three-layer Planner → Executor → Reflector architecture with real-time reasoning visibility through Dynamic Inference Pathways.
- Price/Performance Trade-off: At $5/$30 per MTok (input/output), it's 2x GPT-5.4's cost—justified by substantially better quality across all benchmarks.
GPT-5 Family Overview
The GPT-5 family represents OpenAI's most differentiated model lineup. Rather than a single model, OpenAI ships five tiers designed for distinct use cases.
| Model | Target Use Case | Context | Key Strength |
|---|---|---|---|
| GPT-5 Nano | Edge / mobile | 32K | Latency < 50ms, on-device |
| GPT-5 Mini | Cost-sensitive apps | 128K | 90% quality at 10% cost |
| GPT-5 (Standard) | General-purpose | 256K | Balanced performance |
| GPT-5.5 | Agent flagship | 1.05M | Agentic reasoning, omnimodal |
| GPT-5 Ultra | Research / compute-intensive | 1.05M | Maximum quality, no cost limit |
GPT-5.5 occupies the "agent flagship" position—it is specifically optimized for multi-step autonomous workflows where a model needs to plan, execute, observe, and self-correct over extended interactions. Its knowledge cutoff is December 1, 2025.
Architecture: Sparse MoE with Dynamic Activation
GPT-5.5's core innovation is its Sparse Mixture-of-Experts architecture with dynamic activation routing. Unlike dense Transformer models where every parameter participates in every forward pass, GPT-5.5 activates only 8–15% of its expert modules for each inference token.
How Dynamic Activation Works
The router network makes a per-token decision about which expert modules to activate. This differs from earlier MoE implementations (like Mixtral's fixed top-2 routing) in three ways:
- Variable expert count: The number of active experts varies between 8% and 15% depending on input complexity—simple tokens route to fewer experts, ambiguous tokens activate more.
- Cross-modal routing: The same routing mechanism works across text, image, audio, and video tokens, allowing experts to specialize by modality or cross-modal reasoning.
- Load balancing via auxiliary loss: A learned auxiliary loss prevents expert collapse (where all tokens route to the same few experts).
| Property | GPT-4 (Dense) | Mixtral 8x22B | GPT-5.5 (Dynamic MoE) |
|---|---|---|---|
| Active params per token | 100% | ~12.5% (2/16) | 8–15% (dynamic) |
| Routing strategy | N/A | Fixed top-2 | Learned dynamic |
| Cross-modal routing | No | No | Yes |
| Expert specialization | N/A | Layer-level | Token-level |
This architecture means GPT-5.5's total parameter count is massive, but the actual compute per inference step remains tractable—a critical factor for serving 1M token contexts at acceptable latencies.
Natively Omnimodal: One Model, All Modalities
GPT-5.5 processes text, images, audio, and video within a single unified architecture. This is architecturally distinct from pipeline approaches where separate encoders feed into a Large Language Model backbone.
Why Unified Matters
In pipeline architectures, cross-modal reasoning is limited by the information bottleneck between encoder outputs and the language model. A vision encoder compresses an image into a fixed representation before the language model sees it.
In GPT-5.5's unified design:
- Image patches, audio frames, and video segments are tokenized into the same embedding space as text
- Expert modules can specialize in cross-modal patterns (e.g., "audio that contradicts what's shown on screen")
- Attention operates across all modalities simultaneously—no information is lost at interface boundaries
- The model can generate outputs in any modality without separate decoder heads
This is why GPT-5.5 achieves significant improvements on tasks requiring tight cross-modal reasoning, such as video understanding with temporal audio alignment.
Context Window: 1 Million Tokens in Production
GPT-5.5 delivers a production context window of approximately 1.05 million tokens via API, with 400K tokens available in OpenAI's Codex environment.
| Specification | Value |
|---|---|
| Max total context | ~1.05M tokens |
| Max input tokens | 922K |
| Max output tokens | 128K |
| Codex context | 400K |
| MRCR v2 at 1M (accuracy) | 74.0% |
| MRCR v2 at 1M (GPT-5.4) | 36.6% |
The jump from 36.6% to 74.0% on MRCR v2 (Multi-Round Conversation Retrieval) at 1M tokens is the most dramatic improvement in the GPT-5.5 release. This means the model can reliably retrieve and reason over information placed anywhere in a million-token context—a capability that was essentially non-functional in GPT-5.4.
Practical Implications
With 922K input tokens, you can fit:
- An entire medium-sized codebase (~50,000 lines of code)
- A full technical book (300+ pages)
- Months of conversation history for persistent agents
- Complete API documentation sets for complex integrations
The 128K output limit means GPT-5.5 can generate entire file sets, comprehensive reports, or complete refactoring patches in a single response.
Agentic Architecture: Planner-Executor-Reflector
GPT-5.5 is designed from the ground up for agentic workflows. Its internal architecture implements a three-layer reasoning loop.
The Three Layers
1. Planner: Decomposes a high-level task into an ordered sequence of subtasks. The planner has access to the full context and can reason about dependencies, resource constraints, and potential failure modes.
2. Executor: Carries out each subtask by generating code, calling tools, or producing intermediate outputs. The executor operates with focused attention on the current subtask while maintaining awareness of the overall plan.
3. Reflector: Evaluates executor outputs against expected outcomes. It detects errors, identifies when a subtask needs retry with a different approach, and determines when to escalate back to the planner for re-planning.
This architecture enables GPT-5.5 to handle complex multi-step workflows—like debugging a distributed system or refactoring a large codebase—without human intervention at each step.
Dynamic Inference Pathways and Reasoning Effort
GPT-5.5 introduces Dynamic Inference Pathways with real-time reasoning visibility and configurable reasoning effort levels.
Five Reasoning Effort Levels
| Level | Behavior | Use Case | Relative Cost |
|---|---|---|---|
none |
Direct response, no chain-of-thought | Lookups, classification | 0.3x |
low |
Brief internal reasoning | Simple Q&A, translation | 0.6x |
medium (default) |
Standard reasoning chain | General tasks | 1.0x |
high |
Extended reasoning with self-verification | Complex analysis, coding | 2.5x |
xhigh |
Maximum compute, parallel reasoning paths | Research, novel problems | 5.0x+ |
GPT-5.5 Pro goes further with parallel test-time compute—running multiple reasoning paths simultaneously and selecting the best outcome. This is analogous to best-of-N sampling but applied to structured reasoning chains rather than raw token generation.
Real-Time Reasoning Visibility
Unlike previous "black box" models, GPT-5.5 exposes its reasoning process through the API. You can observe the model's planning steps, tool calls, and self-correction in real time—critical for debugging agentic workflows and building trust in autonomous systems.
from openai import OpenAI
client = OpenAI()
# Stream with reasoning visibility
stream = client.chat.completions.create(
model="gpt-5.5",
messages=[
{"role": "system", "content": "You are a senior software architect."},
{"role": "user", "content": "Refactor this microservice to use event sourcing."}
],
reasoning_effort="high",
stream=True,
stream_options={"include_reasoning": True}
)
for chunk in stream:
if chunk.choices[0].delta.reasoning_content:
print(f"[THINKING] {chunk.choices[0].delta.reasoning_content}")
elif chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Benchmark Analysis
GPT-5.5 sets new records on reasoning and scientific benchmarks while showing competitive but not dominant performance on pure coding tasks.
| Benchmark | GPT-5.4 | GPT-5.5 | Claude Opus 4 | Best in Class |
|---|---|---|---|---|
| ARC-AGI-2 | 71.2% | 85.0% | 79.3% | GPT-5.5 |
| GPQA Diamond | 87.4% | 93.6% | 89.1% | GPT-5.5 |
| Terminal-Bench 2.0 | 68.5% | 82.7% | 78.2% | GPT-5.5 |
| MRCR v2 (1M tokens) | 36.6% | 74.0% | 62.4% | GPT-5.5 |
| SWE-bench Pro | 58.1% | 61.8% | 64.3% | Claude Opus 4 |
| HumanEval+ | 94.2% | 96.1% | 97.0% | Claude Opus 4 |
Key Observations
- Reasoning dominance: GPT-5.5 crushes the competition on ARC-AGI-2 (abstract reasoning) and GPQA Diamond (PhD-level science questions).
- Long-context breakthrough: The 2x improvement on MRCR v2 at 1M tokens is unprecedented—this is a different class of capability.
- Coding gap persists: Claude Opus 4 maintains its lead on SWE-bench Pro (64.3% vs 61.8%), particularly on real-world multi-file code modifications.
- Terminal mastery: Terminal-Bench 2.0 at 82.7% shows GPT-5.5's strength in shell-based problem solving—critical for agentic DevOps workflows.
Hardware Co-Design: NVIDIA GB200/GB300 NVL72
GPT-5.5 was co-designed with NVIDIA's latest GB200 and GB300 NVL72 rack-scale systems. This is not just "running on NVIDIA GPUs"—the model architecture was specifically optimized for these systems' communication topology.
Why Co-Design Matters
- Expert placement: MoE experts are distributed across GPUs to minimize cross-node communication. Frequently co-activated experts are placed on the same NVLink domain.
- KV-cache distribution: At 1M token contexts, the KV-cache is too large for a single GPU. It's sharded across the NVL72 rack with optimized attention patterns.
- Dynamic routing efficiency: The per-token routing decision must be fast enough to not bottleneck inference. NVLink 5.0's 1.8 TB/s bandwidth enables this.
- Inference parallelism: GPT-5.5 Pro's parallel test-time compute runs multiple reasoning paths simultaneously across different GPU subsets within the same rack.
API Integration: Python and JavaScript Examples
Python: Basic Completion with Reasoning Effort
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5.5",
messages=[
{
"role": "system",
"content": "You are an expert data engineer."
},
{
"role": "user",
"content": "Analyze this 500K-line JSON log file and identify the root cause of the latency spike at 14:32 UTC.",
"attachments": [{"file_id": "file-abc123"}]
}
],
reasoning_effort="high",
max_completion_tokens=16384
)
print(response.choices[0].message.content)
print(f"Reasoning tokens used: {response.usage.reasoning_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
Python: Agentic Workflow with Tool Use
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "run_terminal_command",
"description": "Execute a shell command and return output",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "The shell command to run"}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read file contents from the codebase",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path relative to repo root"}
},
"required": ["path"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-5.5",
messages=[
{"role": "system", "content": "You are an autonomous DevOps agent. Diagnose and fix issues."},
{"role": "user", "content": "The CI pipeline is failing on the integration tests. Investigate and fix."}
],
tools=tools,
reasoning_effort="high",
tool_choice="auto"
)
# Handle the Planner → Executor → Reflector loop
for choice in response.choices:
if choice.message.tool_calls:
for tool_call in choice.message.tool_calls:
print(f"Agent action: {tool_call.function.name}({tool_call.function.arguments})")
JavaScript: Streaming with Reasoning Visibility
import OpenAI from 'openai';
const openai = new OpenAI();
async function analyzeWithReasoning(prompt) {
const stream = await openai.chat.completions.create({
model: 'gpt-5.5',
messages: [
{ role: 'system', content: 'You are a security researcher.' },
{ role: 'user', content: prompt }
],
reasoning_effort: 'high',
stream: true,
stream_options: { include_reasoning: true }
});
let reasoning = '';
let response = '';
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta;
if (delta?.reasoning_content) {
reasoning += delta.reasoning_content;
process.stdout.write(`\x1b[90m${delta.reasoning_content}\x1b[0m`);
} else if (delta?.content) {
response += delta.content;
process.stdout.write(delta.content);
}
}
return { reasoning, response };
}
// Usage: Analyze a large codebase for vulnerabilities
const result = await analyzeWithReasoning(
'Review this authentication module for security vulnerabilities: ...'
);
console.log(`\nReasoning length: ${result.reasoning.length} chars`);
JavaScript: Multi-Modal Input
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI();
const imageBuffer = fs.readFileSync('architecture-diagram.png');
const base64Image = imageBuffer.toString('base64');
const response = await openai.chat.completions.create({
model: 'gpt-5.5',
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Analyze this system architecture diagram. Identify single points of failure and suggest improvements.'
},
{
type: 'image_url',
image_url: {
url: `data:image/png;base64,${base64Image}`
}
}
]
}
],
reasoning_effort: 'medium',
max_completion_tokens: 4096
});
console.log(response.choices[0].message.content);
Pricing and Economics
GPT-5.5's pricing reflects its positioning as a premium flagship model.
| Model | Input (per MTok) | Output (per MTok) | Cached Input | Context |
|---|---|---|---|---|
| GPT-5 Nano | $0.10 | $0.40 | $0.05 | 32K |
| GPT-5 Mini | $0.50 | $2.00 | $0.25 | 128K |
| GPT-5 (Standard) | $2.00 | $10.00 | $1.00 | 256K |
| GPT-5.5 | $5.00 | $30.00 | $2.50 | 1.05M |
| GPT-5 Ultra | $15.00 | $75.00 | $7.50 | 1.05M |
Cost Analysis
At $5/$30 per MTok, GPT-5.5 is exactly 2x the cost of GPT-5.4. For a typical agentic workflow:
- Simple query (1K in, 500 out): $0.005 + $0.015 = $0.02
- Code review (50K in, 5K out): $0.25 + $0.15 = $0.40
- Full codebase analysis (500K in, 50K out): $2.50 + $1.50 = $4.00
- Max context session (922K in, 128K out): $4.61 + $3.84 = $8.45
The economics favor GPT-5.5 for complex tasks where quality matters. For high-volume, simpler tasks, GPT-5 Mini at 10x lower cost is the rational choice.
GPT-5.5 vs Claude Opus 4: Head-to-Head
The two frontier models serve different niches despite competing directly.
| Dimension | GPT-5.5 | Claude Opus 4 |
|---|---|---|
| Architecture | Sparse MoE (8-15% active) | Dense Transformer |
| Context | 1.05M tokens | 200K tokens |
| Modalities | Text, image, audio, video | Text, image, code |
| Reasoning | Dynamic Inference Pathways | Extended Thinking |
| Coding (SWE-bench Pro) | 61.8% | 64.3% |
| Science (GPQA Diamond) | 93.6% | 89.1% |
| Abstract Reasoning (ARC-AGI-2) | 85.0% | 79.3% |
| Long Context (MRCR v2 1M) | 74.0% | N/A (200K max) |
| Autonomous Duration | Continuous (Planner loop) | 7 hours max |
| Pricing (in/out per MTok) | $5 / $30 | $15 / $75 |
| Safety Framework | Internal review | ASL-3 certified |
When to Choose GPT-5.5
- Tasks requiring >200K context (large codebases, long document analysis)
- Multi-modal workflows combining text with audio/video
- Scientific reasoning and abstract problem-solving
- Cost-sensitive production deployments (2.5x cheaper than Claude Opus 4)
- Real-time reasoning visibility requirements
When to Choose Claude Opus 4
- Pure coding tasks and multi-file refactoring
- Tasks requiring maximum safety guarantees (ASL-3)
- Workflows needing extended autonomous execution with proven reliability
- Enterprise environments where Anthropic's safety-first approach aligns with compliance needs
FAQ
Q: What is GPT-5.5 and how does it differ from GPT-5?
GPT-5.5 (codename "Spud") is the agent-flagship model in the GPT-5 family, released April 23, 2026. It is the first fully retrained base model since GPT-4.5—all interim versions (5.1–5.4) were post-training iterations on the same base. GPT-5.5 features native omnimodal processing, Sparse MoE with 8–15% expert activation, and a 1.05M token context window.
Q: What hardware powers GPT-5.5 inference?
GPT-5.5 was co-designed with NVIDIA GB200/GB300 NVL72 rack-scale systems. These custom clusters enable the model's dynamic expert routing and massive context windows while keeping inference latency manageable at scale.
Q: How does GPT-5.5 pricing compare to GPT-5.4?
GPT-5.5 API pricing is $5 per million input tokens and $30 per million output tokens—exactly double GPT-5.4's rates. OpenAI justifies this through the model's substantially higher quality, longer context, and new reasoning capabilities.
Q: Can GPT-5.5 outperform Claude Opus 4 on coding tasks?
Not yet. On SWE-bench Pro, Claude Opus 4 scores 64.3% while GPT-5.5 trails at 61.8%. However, GPT-5.5 excels in multi-step agentic workflows, long-context retrieval (74% on MRCR v2 at 1M tokens), and scientific reasoning (GPQA Diamond 93.6%).
Q: What are reasoning effort levels in GPT-5.5?
GPT-5.5 supports five reasoning effort levels—none, low, medium (default), high, and xhigh. Higher levels allocate more test-time compute for complex tasks, while lower levels provide faster, cheaper responses for simple queries. GPT-5.5 Pro uses parallel test-time compute for maximum performance.
Summary
GPT-5.5 represents a genuine architectural leap for OpenAI—the first fully retrained base model in over a year. Its Sparse MoE with dynamic activation, natively omnimodal design, million-token context, and three-layer agentic architecture position it as the most capable model for autonomous AI agent workflows. While it trails Claude Opus 4 on pure coding benchmarks, its advantages in reasoning, multimodal processing, long-context retrieval, and cost-effectiveness make it the rational default choice for most production AI applications in 2026.
The model's co-design with NVIDIA GB300 NVL72 hardware signals a future where model architecture and inference infrastructure are inseparable—expect this trend to accelerate as we approach even larger context windows and more complex reasoning workloads.
For developers building with GPT-5.5 today, the key decision points are:
- Use reasoning effort levels to optimize cost—don't run
xhighfor everything - Leverage the 922K input context for tasks that previously required RAG
- Build around the Planner → Executor → Reflector pattern for autonomous workflows
- Use text-diff tools to compare outputs across reasoning levels
- Validate JSON responses with a JSON formatter when building structured extraction pipelines
Related Resources
QubitTool Blog Posts
- LLM Landscape 2026: Differentiated Strategies of the Five Major Camps — Understand where GPT-5.5 fits in the broader AI ecosystem
- Transformer Architecture Complete Guide — Deep dive into the foundation architecture that GPT-5.5 builds upon
QubitTool Glossary
- Large Language Model (LLM) — Core concept behind GPT-5.5's text generation capabilities
- Transformer — The base architecture that Sparse MoE extends
QubitTool Developer Tools
- JSON Formatter — Format and validate GPT-5.5 API responses and structured outputs
- Text Diff — Compare model outputs across reasoning effort levels or model versions