TL;DR: Transformers have dominated AI for nearly eight years, but their O(n²) attention mechanism is hitting a wall on long sequences. Mamba and State Space Models (SSM) deliver comparable language modeling quality at O(n) linear complexity—5x+ throughput gains and orders-of-magnitude memory savings. From IBM Granite 4.0's 9:1 hybrid architecture to NVIDIA Nemotron 3's million-token context, Mamba is moving from research papers into production.
Table of Contents
- Why We Need to Go Beyond Transformers
- Mathematical Foundations of State Space Models
- From S4 to Mamba: The Evolution of SSMs
- Mamba's Core Innovation: Selective State Spaces
- Mamba-2 and Mamba-3: Continued Evolution
- Hands-On: Running Inference with Mamba
- Transformer + Mamba Hybrid Architectures
- Benchmarks: Comprehensive Performance Comparison
- Limitations and Future Outlook
- FAQ
- Related Resources
Key Takeaways
- Linear Complexity: Mamba reduces sequence modeling from Transformer's O(n²) to O(n), making million-token contexts practical.
- Selective Mechanism: Input-dependent SSM parameters give Mamba attention-like content awareness while maintaining linear efficiency.
- Hardware-Friendly: Mamba's core operations (scan and matmul) align perfectly with modern GPU memory hierarchies.
- Hybrid Trend: Production models like IBM Granite 4.0 (9:1 Mamba-Transformer) and NVIDIA Nemotron 3 validate the hybrid approach.
- Rapid Evolution: Mamba-2's SSD framework unifies SSM and attention theory; Mamba-3 pushes inference efficiency further.
💡 Tool Tip: Use the JSON Formatter to parse complex inference results from Mamba model APIs, or browse the AI Directory to discover the latest SSM-based models and tools.
Why We Need to Go Beyond Transformers
The Transformer architecture revolutionized AI since the 2017 "Attention is All You Need" paper. However, its core self-attention mechanism has a fundamental efficiency problem.
The Cost of Quadratic Complexity
Self-attention computes pairwise relationships between every token in a sequence. For a sequence of length n:
- Compute: Generates an n × n attention matrix → O(n²) complexity
- Memory: The attention matrix itself requires O(n²) GPU memory
- KV Cache: Inference requires caching all historical Keys and Values, growing linearly with sequence length
What does this mean in practice?
Sequence Length Attention Compute Relative Cost
1K tokens 1M operations 1x (baseline)
4K tokens 16M operations 16x
32K tokens 1,024M operations 1,024x
128K tokens 16,384M operations 16,384x
1M tokens 1,000,000M ops 1,000,000x
Scaling from 1K to 1M tokens inflates compute by one million times. Even with optimizations like FlashAttention, pure Transformers face severe challenges processing long documents, codebases, or extended conversations.
What Would the Ideal Sequence Model Look Like?
Researchers have been searching for an architecture that can deliver:
- Linear complexity: O(n) compute scaling with sequence length
- Content awareness: Dynamic behavior based on input content
- Long-range dependencies: Effective modeling across millions of tokens
- Hardware efficiency: Full utilization of GPU parallelism and memory bandwidth
State Space Models (SSM) provide an elegant mathematical framework for exactly these requirements.
Mathematical Foundations of State Space Models
What Is a State Space Model?
State Space Models aren't a new AI invention—they originate from control theory and signal processing, with decades of history. The core idea is to compress a sequence's history into a "hidden state," then generate outputs based on that state.
The continuous-time SSM is described by two equations:
State update: h'(t) = A · h(t) + B · x(t)
Output: y(t) = C · h(t) + D · x(t)
Where:
- x(t): input signal
- h(t): hidden state (compressed history)
- y(t): output signal
- A: state transition matrix (how state evolves)
- B: input matrix (how input affects state)
- C: output matrix (how state maps to output)
- D: skip connection (usually omitted)
Think of an SSM as a "filter with memory":
- A is the "forget gate": controls how fast past information decays
- B is the "input gate": controls how much current input is written to memory
- C is the "read gate": controls what information is extracted from memory
Like RNNs, SSMs maintain a fixed-size state at each step, guaranteeing constant memory and linear compute during inference. Unlike RNNs, SSMs can also be computed as convolutions for parallel training.
Discretization and Dual Computation Modes
Digital sequences (like token sequences) are discrete. SSMs use a "discretization step Δ" to convert continuous equations into discrete recurrences:
Discretized form:
h[k] = Ā · h[k-1] + B̄ · x[k]
y[k] = C · h[k]
Where:
Ā = exp(A · Δ)
B̄ ≈ (Ā - I) · A⁻¹ · B · Δ
SSMs naturally support two computation modes:
Recurrent mode (for inference):
Step-by-step: h[k] = Ā · h[k-1] + B̄ · x[k]
Complexity: O(n), constant time per step
✅ Ideal for autoregressive generation
Convolutional mode (for training):
Unroll SSM into global convolution kernel
K = (C·B̄, C·Ā·B̄, C·Ā²·B̄, ...)
Accelerate with FFT
✅ Ideal for GPU-parallel training
📝 Glossary: Deep Learning — SSMs represent an important branch of deep learning for sequence modeling, increasingly complementing Transformers.
From S4 to Mamba: The Evolution of SSMs
HiPPO: Mathematical Foundation for Memory (2020)
It all started with a question: how can models efficiently remember very long input sequences?
Albert Gu et al. proposed the HiPPO (High-order Polynomial Projection Operator) framework. HiPPO showed that special matrix initialization allows SSMs to optimally compress an input sequence's history into a finite-dimensional state using Legendre polynomial coefficients.
S4: Structured State Spaces (2021)
S4 (Structured State Space for Sequence Modeling) was SSM's first major breakthrough:
- HiPPO initialization: Using HiPPO-LegS matrix for the A matrix, solving long-range dependency learning
- Structured parameterization: Decomposing A into DPLR (diagonal plus low-rank) form for efficient computation
- Dual-mode computation: FFT convolution for parallel training, recurrence for constant-memory inference
S4 achieved breakthrough results on the Long Range Arena benchmark, first proving SSMs could compete with Transformers on long-sequence tasks.
Evolution Timeline
Mamba's Core Innovation: Selective State Spaces
The Fatal Flaw of Traditional SSMs
While S4 excelled at long-sequence tasks, it still lagged behind Transformers at language modeling. The root cause: traditional SSM parameters (A, B, C) are fixed and don't vary with input content.
This means traditional SSMs treat all inputs equally—whether the current token carries critical information or is irrelevant noise, the state update is identical. Like a tape recorder that can't distinguish important dialogue from background noise.
Transformer's self-attention is powerful precisely because it's content-aware—dynamically deciding what to focus on based on input content.
Mamba's Solution: Input-Dependent Parameters
In December 2023, Albert Gu and Tri Dao published the Mamba paper, introducing Selective State Space Models. The core idea is elegantly simple:
Make SSM parameters B, C, and Δ input-dependent through learned linear projections.
# Traditional SSM (fixed parameters)
class TraditionalSSM:
def __init__(self, d_model, d_state):
self.A = nn.Parameter(...) # Fixed
self.B = nn.Parameter(...) # Fixed
self.C = nn.Parameter(...) # Fixed
self.delta = nn.Parameter(...) # Fixed
def forward(self, x):
# Same A, B, C, delta for all inputs
...
# Mamba's Selective SSM (input-dependent parameters)
class SelectiveSSM:
def __init__(self, d_model, d_state):
self.A = nn.Parameter(...) # A remains fixed (structured)
self.s_B = nn.Linear(d_model, d_state) # B generated from input
self.s_C = nn.Linear(d_model, d_state) # C generated from input
self.s_delta = nn.Linear(d_model, 1) # Δ generated from input
def forward(self, x):
B = self.s_B(x) # B(x): input decides how to write state
C = self.s_C(x) # C(x): input decides how to read state
delta = softplus(self.s_delta(x)) # Δ(x): input decides time step
# Parameters differ at every position!
...
Intuition Behind Selection
The selectivity of Δ is particularly critical—it controls "temporal resolution":
- Large Δ: Model strongly writes current input to state, forgets more history → focus on current token
- Small Δ: Model nearly ignores current input, preserves state → skip current token, retain history
This gives Mamba the ability to "remember what matters" and "skip what doesn't"—achieving attention-like content selection at O(n) complexity.
Hardware-Aware Algorithm
With input-dependent parameters, SSMs can no longer use FFT convolution for training (the kernel is no longer fixed). Mamba introduces a hardware-aware parallel scan algorithm:
- Avoid materializing large states: Don't store full (batch, length, d_model, d_state) tensors in GPU HBM
- Kernel fusion: Fuse discretization, scan, and output projection into a single CUDA kernel
- SRAM utilization: Perform core computation in GPU's fast on-chip SRAM
- Recomputation: Recompute intermediate states during backprop, trading compute for memory
Inspired by FlashAttention, this design makes Mamba faster on real hardware than theoretical predictions suggest.
Mamba-2 and Mamba-3: Continued Evolution
Mamba-2: Structured State Space Duality (SSD)
In 2024, Gu and Dao published Mamba-2, introducing the Structured State Space Duality (SSD) framework.
The core insight: SSMs and attention are mathematically the same thing viewed from different perspectives.
When the A matrix is constrained to scalar times identity (A = a·I), the selective SSM output is equivalent to a special form of "masked attention":
SSM view (recurrence):
h[k] = a · h[k-1] + B[k] · x[k]
y[k] = C[k] · h[k]
Equivalent attention view:
y = M ⊙ (Q · K^T) · V
Where M is a semiseparable matrix (lower-triangular mask + exponential decay)
Q and K correspond to projections of C and B
Practical implications:
- Training speedup: Core computation becomes matrix multiplication → direct Tensor Core utilization → 2-8x faster
- Architectural unification: SSM and attention layers can be freely mixed under one theoretical framework
- Algorithmic flexibility: Dynamically choose recurrence (short sequences) or matmul (long sequences)
Mamba-3: Inference-First Design (2025)
Mamba-3, developed by CMU, Princeton, and Together AI, addresses a key insight: linear theoretical complexity doesn't automatically translate to faster inference on real hardware.
The issue was that Mamba-1/2's large state expansions (d_state=128+) created memory bandwidth bottlenecks during inference. Mamba-3 solves this through:
- Exponential-Trapezoidal Discretization: A second-order accurate method replacing the first-order approach, maintaining modeling capacity at smaller state dimensions
- Sparse State Expansion: Different attention heads use different state sizes, allocating larger states to more important heads
- Inference-First Paradigm: Optimizing for actual decode latency rather than training FLOPs
Result: Mamba-3 matches Mamba-2's perplexity with half the state size, dramatically reducing inference memory consumption.
Hands-On: Running Inference with Mamba
Setup
# Install Mamba core library
pip install mamba-ssm>=2.2.0
# Dependencies (requires CUDA)
pip install causal-conv1d>=1.4.0
pip install torch>=2.1.0
# Or use pretrained Mamba models via transformers
pip install transformers>=4.39.0
Loading Mamba with Hugging Face Transformers
from transformers import MambaForCausalLM, AutoTokenizer
import torch
# Load pretrained Mamba model
model_name = "state-spaces/mamba-2.8b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MambaForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Text generation
prompt = "The key advantage of state space models over transformers is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Using the Native Mamba Library
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
import torch
# Load native Mamba model (faster inference)
model = MambaLMHeadModel.from_pretrained(
"state-spaces/mamba-2.8b",
dtype=torch.float16,
device="cuda"
)
# Autoregressive generation (demonstrating SSM recurrence)
input_ids = tokenizer.encode("Explain state space models:", return_tensors="pt")
input_ids = input_ids.to("cuda")
output_ids = model.generate(
input_ids=input_ids,
max_length=300,
temperature=0.7,
top_k=50,
cg=True # Enable CUDA Graph optimization
)
print(tokenizer.decode(output_ids[0]))
Benchmarking Inference Memory
import torch
import time
def benchmark_inference(model, tokenizer, seq_lengths, device="cuda"):
"""Compare inference efficiency across sequence lengths"""
results = []
for seq_len in seq_lengths:
input_ids = torch.randint(0, 32000, (1, seq_len)).to(device)
# Warmup
with torch.no_grad():
_ = model(input_ids)
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
# Timed run
start = time.perf_counter()
with torch.no_grad():
for _ in range(10):
_ = model(input_ids)
torch.cuda.synchronize()
elapsed = (time.perf_counter() - start) / 10
peak_mem = torch.cuda.max_memory_allocated() / (1024 ** 3)
results.append({
"seq_len": seq_len,
"latency_ms": elapsed * 1000,
"peak_memory_gb": peak_mem
})
print(f" Length {seq_len:>6d}: {elapsed*1000:.1f}ms, {peak_mem:.2f}GB")
return results
seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
print("Mamba inference benchmark:")
mamba_results = benchmark_inference(mamba_model, tokenizer, seq_lengths)
Transformer + Mamba Hybrid Architectures
Why Go Hybrid?
Despite Mamba's efficiency advantages, pure SSM architectures still trail Transformers on certain tasks:
| Capability | Transformer | Mamba | Hybrid |
|---|---|---|---|
| Long sequences | ❌ O(n²) bottleneck | ✅ O(n) linear | ✅ |
| In-context learning | ✅ Precise recall | ⚠️ Approximate | ✅ |
| Few-shot prompting | ✅ Strong | ⚠️ Weaker | ✅ |
| Inference memory | ❌ Large KV Cache | ✅ Constant state | ✅ ~80% less |
| Training ecosystem | ✅ Mature | ⚠️ Emerging | ✅ |
The core reason: Transformer's attention can precisely recall information from any position via KV Cache, while SSMs compress history into fixed-size states with inevitable information loss.
IBM Granite 4.0: The 9:1 Hybrid
IBM's Granite 4.0 series (released 2025) is a flagship example of hybrid architecture:
90% Mamba-2 layers for long-range context (linear complexity), 10% Transformer layers for fine-grained local parsing (attention's precise recall).
Granite 4.0-H Architecture (H-Small example):
Layer 0: [Mamba-2] ─┐
Layer 1: [Mamba-2] │
Layer 2: [Mamba-2] │ 9 Mamba-2 layers
Layer 3: [Mamba-2] │ Efficient global context
... │
Layer 8: [Mamba-2] ─┘
Layer 9: [Transformer] ← 1 Transformer layer
Layer 10: [Mamba-2] ─┐ Fine-grained local parsing
Layer 11: [Mamba-2] │
... │ Repeat 9:1 pattern
Layer 18: [Mamba-2] ─┘
Layer 19: [Transformer] ← ...
...
Real-world results:
- 512K context runs on a single GPU (8GB VRAM)
- ~80% less inference memory vs. same-size pure Transformer
- Comparable accuracy on LLM inference benchmarks
NVIDIA Nemotron 3: Million-Token MoE Hybrid
NVIDIA's Nemotron 3 takes it further by fusing three technologies:
- Mamba layers: Long-range dependencies, million-token context
- Transformer layers: Precise local attention and in-context learning
- MoE layers: Sparse activation for capacity scaling
This three-way combination represents the frontier of large model design—no single architecture "rules all," but rather different architectures serve different purposes in concert.
Benchmarks: Comprehensive Performance Comparison
Inference Throughput
Based on published benchmarks and paper results:
Inference throughput (tokens/sec, batch_size=1, A100 GPU):
Model 1K tokens 4K tokens 16K tokens 64K tokens
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Transformer-3B 2,100 1,800 950 OOM
Mamba-3B 2,300 2,250 2,200 2,100
Hybrid (9:1) 2,200 2,100 1,950 1,850
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Note: Transformer OOMs at 64K due to KV Cache
Mamba throughput stays nearly constant across all lengths ✅
Memory Usage
Inference GPU memory (GB, fp16, batch_size=1):
Sequence Length Transformer-3B Mamba-3B Savings
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1K tokens 6.2 5.8 ~6%
4K tokens 7.1 5.8 ~18%
16K tokens 11.4 5.9 ~48%
64K tokens 28.6 5.9 ~79%
128K tokens OOM 6.0 ∞
512K tokens OOM 6.1 ∞
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Key finding: Mamba memory barely grows with sequence length ✅
Language Modeling Quality
On standard language modeling benchmarks (lower perplexity is better):
| Model | Params | Pile (ppl) | LAMBADA (acc) | HellaSwag (acc) |
|---|---|---|---|---|
| Transformer | 2.8B | 7.82 | 65.6% | 60.2% |
| Mamba | 2.8B | 7.33 | 66.2% | 60.4% |
| Mamba-2 | 2.8B | 7.29 | 66.8% | 61.1% |
At equal parameter counts, Mamba matches or slightly exceeds Transformer quality while delivering dramatically better inference efficiency.
Limitations and Future Outlook
Current Limitations
1. Precise Recall Weakness
SSMs compress all history into a fixed-size state, unable to precisely retrieve arbitrary historical positions like Transformer's KV Cache. This is particularly noticeable in "needle-in-a-haystack" tasks.
2. Ecosystem Maturity
Transformers have nearly a decade of engineering investment—FlashAttention, vLLM, TensorRT-LLM, and other inference optimization toolchains are highly mature. Mamba's ecosystem is growing fast but still has gaps.
3. Hardware Alignment
Modern GPUs (especially NVIDIA Tensor Cores) are primarily optimized for matrix multiplications, while Mamba-1's core operation is scan—less hardware-efficient. Mamba-2/3 are progressively addressing this.
4. Scale Validation
The largest public pure-Mamba models are around 3B parameters. Whether pure SSM architectures maintain advantages at tens or hundreds of billions of parameters requires further validation.
Future Outlook
- Hybrid architectures become mainstream: As IBM Granite 4.0 and NVIDIA Nemotron 3 demonstrate, future large models will likely be carefully orchestrated mixtures of different architecture types
- Hardware co-design: As Mamba adoption grows, GPU and AI chip vendors may add native hardware support for scan operations
- Ultra-long context as default: Mamba makes million-token contexts economically viable, unlocking new applications—full codebase understanding, ultra-long document analysis, persistent conversation memory
- Multimodal expansion: Works like Mixture-of-Mamba are already exploring SSMs for multimodal (text + image + video) pretraining
📝 Glossary: Neural Network — Whether Transformer or Mamba, both are members of the deep neural network family, representing different approaches to the fundamental problem of sequence modeling.
FAQ
Will Mamba completely replace Transformers?
Not in the near term. The more likely trend is hybrid architectures—Mamba for long-range context, Transformers for precise local interactions requiring exact recall. The two architectures complement rather than replace each other.
What use cases is Mamba best suited for?
Mamba excels in:
- Ultra-long document processing (100K+ tokens)
- Streaming sequence modeling (real-time audio, video, sensor data)
- Edge deployment (memory-constrained environments)
- High-throughput inference services (cost-sensitive API services)
Should developers start using Mamba now?
If you're a model consumer (calling APIs), Mamba's impact is already transparent—hybrid models like IBM Granite 4.0 use Mamba under the hood. If you're a model developer or researcher, now is the ideal time to learn and experiment with SSM architectures. Both mamba-ssm and Hugging Face Transformers provide accessible entry points.
Related Resources
Internal Links
- 📖 Transformer Architecture Deep Dive — Understanding the baseline Mamba aims to surpass
- 📖 Attention Mechanism Explained — The inspiration for Mamba's selective mechanism
- 📖 MoE Architecture Explained — A complementary scaling approach to Mamba
- 📖 LLM Inference Optimization — Understanding KV Cache and inference bottlenecks
- 🔧 AI Directory — Discover the latest AI models and tools
- 🔧 JSON Formatter — Parse model API JSON responses
- 📝 Transformer | Attention Mechanism | LLM | Inference | Neural Network | Deep Learning
External References
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Original Mamba paper
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Mamba-2 paper
- Mamba GitHub Repository — Official implementation
- IBM Granite 4.0 Technical Blog — Hybrid architecture in production