TL;DR: Transformers have dominated AI for nearly eight years, but their O(n²) attention mechanism is hitting a wall on long sequences. Mamba and State Space Models (SSM) deliver comparable language modeling quality at O(n) linear complexity—5x+ throughput gains and orders-of-magnitude memory savings. From IBM Granite 4.0's 9:1 hybrid architecture to NVIDIA Nemotron 3's million-token context, Mamba is moving from research papers into production.

Table of Contents

Key Takeaways

  • Linear Complexity: Mamba reduces sequence modeling from Transformer's O(n²) to O(n), making million-token contexts practical.
  • Selective Mechanism: Input-dependent SSM parameters give Mamba attention-like content awareness while maintaining linear efficiency.
  • Hardware-Friendly: Mamba's core operations (scan and matmul) align perfectly with modern GPU memory hierarchies.
  • Hybrid Trend: Production models like IBM Granite 4.0 (9:1 Mamba-Transformer) and NVIDIA Nemotron 3 validate the hybrid approach.
  • Rapid Evolution: Mamba-2's SSD framework unifies SSM and attention theory; Mamba-3 pushes inference efficiency further.

💡 Tool Tip: Use the JSON Formatter to parse complex inference results from Mamba model APIs, or browse the AI Directory to discover the latest SSM-based models and tools.

Why We Need to Go Beyond Transformers

The Transformer architecture revolutionized AI since the 2017 "Attention is All You Need" paper. However, its core self-attention mechanism has a fundamental efficiency problem.

The Cost of Quadratic Complexity

Self-attention computes pairwise relationships between every token in a sequence. For a sequence of length n:

  • Compute: Generates an n × n attention matrix → O(n²) complexity
  • Memory: The attention matrix itself requires O(n²) GPU memory
  • KV Cache: Inference requires caching all historical Keys and Values, growing linearly with sequence length

What does this mean in practice?

code
Sequence Length    Attention Compute    Relative Cost
1K tokens          1M operations        1x (baseline)
4K tokens          16M operations       16x
32K tokens         1,024M operations    1,024x
128K tokens        16,384M operations   16,384x
1M tokens          1,000,000M ops       1,000,000x

Scaling from 1K to 1M tokens inflates compute by one million times. Even with optimizations like FlashAttention, pure Transformers face severe challenges processing long documents, codebases, or extended conversations.

What Would the Ideal Sequence Model Look Like?

Researchers have been searching for an architecture that can deliver:

  1. Linear complexity: O(n) compute scaling with sequence length
  2. Content awareness: Dynamic behavior based on input content
  3. Long-range dependencies: Effective modeling across millions of tokens
  4. Hardware efficiency: Full utilization of GPU parallelism and memory bandwidth

State Space Models (SSM) provide an elegant mathematical framework for exactly these requirements.

Mathematical Foundations of State Space Models

What Is a State Space Model?

State Space Models aren't a new AI invention—they originate from control theory and signal processing, with decades of history. The core idea is to compress a sequence's history into a "hidden state," then generate outputs based on that state.

The continuous-time SSM is described by two equations:

code
State update:   h'(t) = A · h(t) + B · x(t)
Output:         y(t)  = C · h(t) + D · x(t)

Where:
- x(t): input signal
- h(t): hidden state (compressed history)
- y(t): output signal
- A: state transition matrix (how state evolves)
- B: input matrix (how input affects state)
- C: output matrix (how state maps to output)
- D: skip connection (usually omitted)

Think of an SSM as a "filter with memory":

  • A is the "forget gate": controls how fast past information decays
  • B is the "input gate": controls how much current input is written to memory
  • C is the "read gate": controls what information is extracted from memory

Like RNNs, SSMs maintain a fixed-size state at each step, guaranteeing constant memory and linear compute during inference. Unlike RNNs, SSMs can also be computed as convolutions for parallel training.

Discretization and Dual Computation Modes

Digital sequences (like token sequences) are discrete. SSMs use a "discretization step Δ" to convert continuous equations into discrete recurrences:

code
Discretized form:
  h[k] = Ā · h[k-1] + B̄ · x[k]
  y[k] = C · h[k]

Where:
  Ā = exp(A · Δ)
  B̄ ≈ (Ā - I) · A⁻¹ · B · Δ

SSMs naturally support two computation modes:

code
Recurrent mode (for inference):
  Step-by-step: h[k] = Ā · h[k-1] + B̄ · x[k]
  Complexity: O(n), constant time per step
  ✅ Ideal for autoregressive generation

Convolutional mode (for training):
  Unroll SSM into global convolution kernel
  K = (C·B̄, C·Ā·B̄, C·Ā²·B̄, ...)
  Accelerate with FFT
  ✅ Ideal for GPU-parallel training

📝 Glossary: Deep Learning — SSMs represent an important branch of deep learning for sequence modeling, increasingly complementing Transformers.

From S4 to Mamba: The Evolution of SSMs

HiPPO: Mathematical Foundation for Memory (2020)

It all started with a question: how can models efficiently remember very long input sequences?

Albert Gu et al. proposed the HiPPO (High-order Polynomial Projection Operator) framework. HiPPO showed that special matrix initialization allows SSMs to optimally compress an input sequence's history into a finite-dimensional state using Legendre polynomial coefficients.

S4: Structured State Spaces (2021)

S4 (Structured State Space for Sequence Modeling) was SSM's first major breakthrough:

  1. HiPPO initialization: Using HiPPO-LegS matrix for the A matrix, solving long-range dependency learning
  2. Structured parameterization: Decomposing A into DPLR (diagonal plus low-rank) form for efficient computation
  3. Dual-mode computation: FFT convolution for parallel training, recurrence for constant-memory inference

S4 achieved breakthrough results on the Long Range Arena benchmark, first proving SSMs could compete with Transformers on long-sequence tasks.

Evolution Timeline

graph TD A["HiPPO 2020 - Mathematical foundation for long-range memory"] --> B["S4 2021 - Structured State Spaces, DPLR + FFT convolution"] B --> C["S4D 2022 - Diagonal simplification"] B --> D["S5 2022 - MIMO + parallel scan"] C --> E["Mamba 2023 - Selective SSM, Hardware-aware algorithm"] D --> E E --> F["Mamba-2 2024 - SSD framework, SSM-Attention duality"] F --> G["Mamba-3 2025 - Inference-first design, Exp-trapezoidal discretization"] E --> H["Hybrid architectures - Granite 4.0 / Nemotron 3"]

Mamba's Core Innovation: Selective State Spaces

The Fatal Flaw of Traditional SSMs

While S4 excelled at long-sequence tasks, it still lagged behind Transformers at language modeling. The root cause: traditional SSM parameters (A, B, C) are fixed and don't vary with input content.

This means traditional SSMs treat all inputs equally—whether the current token carries critical information or is irrelevant noise, the state update is identical. Like a tape recorder that can't distinguish important dialogue from background noise.

Transformer's self-attention is powerful precisely because it's content-aware—dynamically deciding what to focus on based on input content.

Mamba's Solution: Input-Dependent Parameters

In December 2023, Albert Gu and Tri Dao published the Mamba paper, introducing Selective State Space Models. The core idea is elegantly simple:

Make SSM parameters B, C, and Δ input-dependent through learned linear projections.

python
# Traditional SSM (fixed parameters)
class TraditionalSSM:
    def __init__(self, d_model, d_state):
        self.A = nn.Parameter(...)      # Fixed
        self.B = nn.Parameter(...)      # Fixed
        self.C = nn.Parameter(...)      # Fixed
        self.delta = nn.Parameter(...)  # Fixed

    def forward(self, x):
        # Same A, B, C, delta for all inputs
        ...

# Mamba's Selective SSM (input-dependent parameters)
class SelectiveSSM:
    def __init__(self, d_model, d_state):
        self.A = nn.Parameter(...)  # A remains fixed (structured)
        self.s_B = nn.Linear(d_model, d_state)    # B generated from input
        self.s_C = nn.Linear(d_model, d_state)    # C generated from input
        self.s_delta = nn.Linear(d_model, 1)      # Δ generated from input

    def forward(self, x):
        B = self.s_B(x)           # B(x): input decides how to write state
        C = self.s_C(x)           # C(x): input decides how to read state
        delta = softplus(self.s_delta(x))  # Δ(x): input decides time step
        # Parameters differ at every position!
        ...

Intuition Behind Selection

The selectivity of Δ is particularly critical—it controls "temporal resolution":

  • Large Δ: Model strongly writes current input to state, forgets more history → focus on current token
  • Small Δ: Model nearly ignores current input, preserves state → skip current token, retain history

This gives Mamba the ability to "remember what matters" and "skip what doesn't"—achieving attention-like content selection at O(n) complexity.

Hardware-Aware Algorithm

With input-dependent parameters, SSMs can no longer use FFT convolution for training (the kernel is no longer fixed). Mamba introduces a hardware-aware parallel scan algorithm:

  1. Avoid materializing large states: Don't store full (batch, length, d_model, d_state) tensors in GPU HBM
  2. Kernel fusion: Fuse discretization, scan, and output projection into a single CUDA kernel
  3. SRAM utilization: Perform core computation in GPU's fast on-chip SRAM
  4. Recomputation: Recompute intermediate states during backprop, trading compute for memory

Inspired by FlashAttention, this design makes Mamba faster on real hardware than theoretical predictions suggest.

Mamba-2 and Mamba-3: Continued Evolution

Mamba-2: Structured State Space Duality (SSD)

In 2024, Gu and Dao published Mamba-2, introducing the Structured State Space Duality (SSD) framework.

The core insight: SSMs and attention are mathematically the same thing viewed from different perspectives.

When the A matrix is constrained to scalar times identity (A = a·I), the selective SSM output is equivalent to a special form of "masked attention":

code
SSM view (recurrence):
  h[k] = a · h[k-1] + B[k] · x[k]
  y[k] = C[k] · h[k]

Equivalent attention view:
  y = M ⊙ (Q · K^T) · V

Where M is a semiseparable matrix (lower-triangular mask + exponential decay)
Q and K correspond to projections of C and B

Practical implications:

  1. Training speedup: Core computation becomes matrix multiplication → direct Tensor Core utilization → 2-8x faster
  2. Architectural unification: SSM and attention layers can be freely mixed under one theoretical framework
  3. Algorithmic flexibility: Dynamically choose recurrence (short sequences) or matmul (long sequences)

Mamba-3: Inference-First Design (2025)

Mamba-3, developed by CMU, Princeton, and Together AI, addresses a key insight: linear theoretical complexity doesn't automatically translate to faster inference on real hardware.

The issue was that Mamba-1/2's large state expansions (d_state=128+) created memory bandwidth bottlenecks during inference. Mamba-3 solves this through:

  1. Exponential-Trapezoidal Discretization: A second-order accurate method replacing the first-order approach, maintaining modeling capacity at smaller state dimensions
  2. Sparse State Expansion: Different attention heads use different state sizes, allocating larger states to more important heads
  3. Inference-First Paradigm: Optimizing for actual decode latency rather than training FLOPs

Result: Mamba-3 matches Mamba-2's perplexity with half the state size, dramatically reducing inference memory consumption.

Hands-On: Running Inference with Mamba

Setup

bash
# Install Mamba core library
pip install mamba-ssm>=2.2.0

# Dependencies (requires CUDA)
pip install causal-conv1d>=1.4.0
pip install torch>=2.1.0

# Or use pretrained Mamba models via transformers
pip install transformers>=4.39.0

Loading Mamba with Hugging Face Transformers

python
from transformers import MambaForCausalLM, AutoTokenizer
import torch

# Load pretrained Mamba model
model_name = "state-spaces/mamba-2.8b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MambaForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Text generation
prompt = "The key advantage of state space models over transformers is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using the Native Mamba Library

python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
import torch

# Load native Mamba model (faster inference)
model = MambaLMHeadModel.from_pretrained(
    "state-spaces/mamba-2.8b",
    dtype=torch.float16,
    device="cuda"
)

# Autoregressive generation (demonstrating SSM recurrence)
input_ids = tokenizer.encode("Explain state space models:", return_tensors="pt")
input_ids = input_ids.to("cuda")

output_ids = model.generate(
    input_ids=input_ids,
    max_length=300,
    temperature=0.7,
    top_k=50,
    cg=True  # Enable CUDA Graph optimization
)

print(tokenizer.decode(output_ids[0]))

Benchmarking Inference Memory

python
import torch
import time

def benchmark_inference(model, tokenizer, seq_lengths, device="cuda"):
    """Compare inference efficiency across sequence lengths"""
    results = []

    for seq_len in seq_lengths:
        input_ids = torch.randint(0, 32000, (1, seq_len)).to(device)

        # Warmup
        with torch.no_grad():
            _ = model(input_ids)

        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()

        # Timed run
        start = time.perf_counter()
        with torch.no_grad():
            for _ in range(10):
                _ = model(input_ids)
        torch.cuda.synchronize()
        elapsed = (time.perf_counter() - start) / 10

        peak_mem = torch.cuda.max_memory_allocated() / (1024 ** 3)

        results.append({
            "seq_len": seq_len,
            "latency_ms": elapsed * 1000,
            "peak_memory_gb": peak_mem
        })
        print(f"  Length {seq_len:>6d}: {elapsed*1000:.1f}ms, {peak_mem:.2f}GB")

    return results

seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
print("Mamba inference benchmark:")
mamba_results = benchmark_inference(mamba_model, tokenizer, seq_lengths)

Transformer + Mamba Hybrid Architectures

Why Go Hybrid?

Despite Mamba's efficiency advantages, pure SSM architectures still trail Transformers on certain tasks:

Capability Transformer Mamba Hybrid
Long sequences ❌ O(n²) bottleneck ✅ O(n) linear
In-context learning ✅ Precise recall ⚠️ Approximate
Few-shot prompting ✅ Strong ⚠️ Weaker
Inference memory ❌ Large KV Cache ✅ Constant state ✅ ~80% less
Training ecosystem ✅ Mature ⚠️ Emerging

The core reason: Transformer's attention can precisely recall information from any position via KV Cache, while SSMs compress history into fixed-size states with inevitable information loss.

IBM Granite 4.0: The 9:1 Hybrid

IBM's Granite 4.0 series (released 2025) is a flagship example of hybrid architecture:

90% Mamba-2 layers for long-range context (linear complexity), 10% Transformer layers for fine-grained local parsing (attention's precise recall).

code
Granite 4.0-H Architecture (H-Small example):

Layer 0:  [Mamba-2]  ─┐
Layer 1:  [Mamba-2]   │
Layer 2:  [Mamba-2]   │ 9 Mamba-2 layers
Layer 3:  [Mamba-2]   │ Efficient global context
...                    │
Layer 8:  [Mamba-2]  ─┘
Layer 9:  [Transformer] ← 1 Transformer layer
Layer 10: [Mamba-2]  ─┐   Fine-grained local parsing
Layer 11: [Mamba-2]   │
...                    │ Repeat 9:1 pattern
Layer 18: [Mamba-2]  ─┘
Layer 19: [Transformer] ← ...
...

Real-world results:

  • 512K context runs on a single GPU (8GB VRAM)
  • ~80% less inference memory vs. same-size pure Transformer
  • Comparable accuracy on LLM inference benchmarks

NVIDIA Nemotron 3: Million-Token MoE Hybrid

NVIDIA's Nemotron 3 takes it further by fusing three technologies:

  1. Mamba layers: Long-range dependencies, million-token context
  2. Transformer layers: Precise local attention and in-context learning
  3. MoE layers: Sparse activation for capacity scaling

This three-way combination represents the frontier of large model design—no single architecture "rules all," but rather different architectures serve different purposes in concert.

Benchmarks: Comprehensive Performance Comparison

Inference Throughput

Based on published benchmarks and paper results:

code
Inference throughput (tokens/sec, batch_size=1, A100 GPU):

Model               1K tokens   4K tokens   16K tokens  64K tokens
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Transformer-3B     2,100       1,800       950         OOM
Mamba-3B           2,300       2,250       2,200       2,100
Hybrid (9:1)       2,200       2,100       1,950       1,850
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note: Transformer OOMs at 64K due to KV Cache
      Mamba throughput stays nearly constant across all lengths ✅

Memory Usage

code
Inference GPU memory (GB, fp16, batch_size=1):

Sequence Length      Transformer-3B    Mamba-3B    Savings
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1K tokens            6.2               5.8         ~6%
4K tokens            7.1               5.8         ~18%
16K tokens           11.4              5.9         ~48%
64K tokens           28.6              5.9         ~79%
128K tokens          OOM               6.0         ∞
512K tokens          OOM               6.1         ∞
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key finding: Mamba memory barely grows with sequence length ✅

Language Modeling Quality

On standard language modeling benchmarks (lower perplexity is better):

Model Params Pile (ppl) LAMBADA (acc) HellaSwag (acc)
Transformer 2.8B 7.82 65.6% 60.2%
Mamba 2.8B 7.33 66.2% 60.4%
Mamba-2 2.8B 7.29 66.8% 61.1%

At equal parameter counts, Mamba matches or slightly exceeds Transformer quality while delivering dramatically better inference efficiency.

Limitations and Future Outlook

Current Limitations

1. Precise Recall Weakness

SSMs compress all history into a fixed-size state, unable to precisely retrieve arbitrary historical positions like Transformer's KV Cache. This is particularly noticeable in "needle-in-a-haystack" tasks.

2. Ecosystem Maturity

Transformers have nearly a decade of engineering investment—FlashAttention, vLLM, TensorRT-LLM, and other inference optimization toolchains are highly mature. Mamba's ecosystem is growing fast but still has gaps.

3. Hardware Alignment

Modern GPUs (especially NVIDIA Tensor Cores) are primarily optimized for matrix multiplications, while Mamba-1's core operation is scan—less hardware-efficient. Mamba-2/3 are progressively addressing this.

4. Scale Validation

The largest public pure-Mamba models are around 3B parameters. Whether pure SSM architectures maintain advantages at tens or hundreds of billions of parameters requires further validation.

Future Outlook

  1. Hybrid architectures become mainstream: As IBM Granite 4.0 and NVIDIA Nemotron 3 demonstrate, future large models will likely be carefully orchestrated mixtures of different architecture types
  2. Hardware co-design: As Mamba adoption grows, GPU and AI chip vendors may add native hardware support for scan operations
  3. Ultra-long context as default: Mamba makes million-token contexts economically viable, unlocking new applications—full codebase understanding, ultra-long document analysis, persistent conversation memory
  4. Multimodal expansion: Works like Mixture-of-Mamba are already exploring SSMs for multimodal (text + image + video) pretraining

📝 Glossary: Neural Network — Whether Transformer or Mamba, both are members of the deep neural network family, representing different approaches to the fundamental problem of sequence modeling.

FAQ

Will Mamba completely replace Transformers?

Not in the near term. The more likely trend is hybrid architectures—Mamba for long-range context, Transformers for precise local interactions requiring exact recall. The two architectures complement rather than replace each other.

What use cases is Mamba best suited for?

Mamba excels in:

  • Ultra-long document processing (100K+ tokens)
  • Streaming sequence modeling (real-time audio, video, sensor data)
  • Edge deployment (memory-constrained environments)
  • High-throughput inference services (cost-sensitive API services)

Should developers start using Mamba now?

If you're a model consumer (calling APIs), Mamba's impact is already transparent—hybrid models like IBM Granite 4.0 use Mamba under the hood. If you're a model developer or researcher, now is the ideal time to learn and experiment with SSM architectures. Both mamba-ssm and Hugging Face Transformers provide accessible entry points.

External References