What is a state space model and how is it different from a transformer?

This is covered in the "Build Intuition for State Space Models and Mamba" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why is Mamba called 'selective' and what changed from S4?

This is covered in the "Build Intuition for State Space Models and Mamba" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Does Mamba beat transformers on long-context retrieval like needle-in-a-haystack?

This is covered in the "Build Intuition for State Space Models and Mamba" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What do the A, B, C, and Delta matrices actually do in an SSM?

This is covered in the "Build Intuition for State Space Models and Mamba" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why did Mamba need a parallel scan to run efficiently on GPUs?

This is covered in the "Build Intuition for State Space Models and Mamba" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🐍Build Intuition for State Space Models and Mamba

Stop reading 'Mamba is linear-time attention' as marketing and start seeing the SSM as a controllable filter — A forgets, B absorbs, C reads out, Δ sets the clock. By the end you can predict whether Mamba or a transformer wins on a 1M-token retrieval task and justify it from the architecture.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1The SSM Bet Against Attention

See why attention scales badly and what a fixed state buys

4 drops

Attention's quadratic cost is a tax, not a feature
6 min
Self-attention compares every token to every other token — that's where the O(n^2) comes from, and nothing about language requires it.
An SSM is a controllable filter, not a new neural net
6 min
State space models come from control theory — they describe how a hidden state evolves over time under input, and how that state produces output. They predate deep learning by 60 years.
A fixed hidden state is the whole gamble
6 min
An SSM compresses the entire past into a fixed-size vector. The model can only remember what fits in h, so the architecture's success or failure rides on how well h is used.
Discretization is just picking a step size
7 min
SSMs are defined as continuous-time equations. To run them on a sequence, you discretize — and the step size Δ becomes a learnable parameter that controls how fast time passes for each input.

Phase 2Running an SSM Step by Hand

Run one SSM step by hand and feel A, B, C work

5 drops

A is the forgetting knob
7 min
The A matrix decides how much of the previous state survives the next step. Its eigenvalues directly set the decay rate of memory.
B decides what gets absorbed
7 min
B is the gate that lets new input into the state. It decides which input features matter and which directions of the hidden state they update.
C is the readout — and what it ignores matters
7 min
C decides which parts of the hidden state become the output. The model can hold information in h that never reads out, and read out things it's currently not storing.
Walk one step on a 5-token sequence
8 min
Three matrices, one state vector, five tokens — that's enough to feel the full SSM update on a piece of paper.
An SSM is a convolution in disguise
7 min
If A, B, C are fixed, you can unroll the recurrence and discover the output is a convolution of the input with a learned filter. Same math, different shape.

Phase 3From S4 to Selective Mamba

Watch selectivity turn S4 into Mamba's language model

4 drops

S4 conquered audio but stumbled on language
8 min
Fixed-parameter SSMs are perfect for signals where every sample matters equally. Language isn't like that — and S4's content-blind filtering left performance on the table.
Selectivity turns three numbers into language
8 min
Mamba makes B, C, and Δ functions of x_t. That single change — three small linear projections — is what unlocked language-quality performance.
GPUs hated selectivity until the parallel scan
8 min
Making A, B, C input-dependent breaks the convolutional fast path that made S4 GPU-friendly. The parallel scan is the algorithm that restored speed without sacrificing the model's expressiveness.
Mamba is excellent at flow, weak at lookup
8 min
An SSM's hidden state is a summary, not a transcript. Mamba beats transformers on language modeling perplexity but loses on tasks that require pulling a specific token back from far away.

Phase 4Predicting the 1M-Token Retrieval Winner

Predict the 1M-token retrieval winner from the architecture

1 drop

Predict the 1M-token needle-in-a-haystack winner
20 min
Predict the 1M-token needle-in-a-haystack winner

Frequently asked questions

What is a state space model and how is it different from a transformer?: This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is Mamba called 'selective' and what changed from S4?: This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Does Mamba beat transformers on long-context retrieval like needle-in-a-haystack?: This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What do the A, B, C, and Delta matrices actually do in an SSM?: This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why did Mamba need a parallel scan to run efficiently on GPUs?: This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🐍Build Intuition for State Space Models and Mamba

Phase 1The SSM Bet Against Attention

Attention's quadratic cost is a tax, not a feature

An SSM is a controllable filter, not a new neural net

A fixed hidden state is the whole gamble

Discretization is just picking a step size

Phase 2Running an SSM Step by Hand

A is the forgetting knob

B decides what gets absorbed

C is the readout — and what it ignores matters

Walk one step on a 5-token sequence

An SSM is a convolution in disguise

Phase 3From S4 to Selective Mamba

S4 conquered audio but stumbled on language

Selectivity turns three numbers into language

GPUs hated selectivity until the parallel scan

Mamba is excellent at flow, weak at lookup

Phase 4Predicting the 1M-Token Retrieval Winner

Predict the 1M-token needle-in-a-haystack winner

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1The SSM Bet Against Attention

Attention's quadratic cost is a tax, not a feature

An SSM is a controllable filter, not a new neural net

A fixed hidden state is the whole gamble

Discretization is just picking a step size

Phase 2Running an SSM Step by Hand

A is the forgetting knob

B decides what gets absorbed

C is the readout — and what it ignores matters

Walk one step on a 5-token sequence

An SSM is a convolution in disguise

Phase 3From S4 to Selective Mamba

S4 conquered audio but stumbled on language

Selectivity turns three numbers into language

GPUs hated selectivity until the parallel scan

Mamba is excellent at flow, weak at lookup

Phase 4Predicting the 1M-Token Retrieval Winner

Predict the 1M-token needle-in-a-haystack winner

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition