🐍Build Intuition for State Space Models and Mamba
Stop reading 'Mamba is linear-time attention' as marketing and start seeing the SSM as a controllable filter — A forgets, B absorbs, C reads out, Δ sets the clock. By the end you can predict whether Mamba or a transformer wins on a 1M-token retrieval task and justify it from the architecture.
Phase 1The SSM Bet Against Attention
See why attention scales badly and what a fixed state buys
Attention's quadratic cost is a tax, not a feature
6 minSelf-attention compares every token to every other token — that's where the O(n^2) comes from, and nothing about language requires it.
An SSM is a controllable filter, not a new neural net
6 minState space models come from control theory — they describe how a hidden state evolves over time under input, and how that state produces output. They predate deep learning by 60 years.
A fixed hidden state is the whole gamble
6 minAn SSM compresses the entire past into a fixed-size vector. The model can only remember what fits in h, so the architecture's success or failure rides on how well h is used.
Discretization is just picking a step size
7 minSSMs are defined as continuous-time equations. To run them on a sequence, you discretize — and the step size Δ becomes a learnable parameter that controls how fast time passes for each input.
Phase 2Running an SSM Step by Hand
Run one SSM step by hand and feel A, B, C work
A is the forgetting knob
7 minThe A matrix decides how much of the previous state survives the next step. Its eigenvalues directly set the decay rate of memory.
B decides what gets absorbed
7 minB is the gate that lets new input into the state. It decides which input features matter and which directions of the hidden state they update.
C is the readout — and what it ignores matters
7 minC decides which parts of the hidden state become the output. The model can hold information in h that never reads out, and read out things it's currently not storing.
Walk one step on a 5-token sequence
8 minThree matrices, one state vector, five tokens — that's enough to feel the full SSM update on a piece of paper.
An SSM is a convolution in disguise
7 minIf A, B, C are fixed, you can unroll the recurrence and discover the output is a convolution of the input with a learned filter. Same math, different shape.
Phase 3From S4 to Selective Mamba
Watch selectivity turn S4 into Mamba's language model
S4 conquered audio but stumbled on language
8 minFixed-parameter SSMs are perfect for signals where every sample matters equally. Language isn't like that — and S4's content-blind filtering left performance on the table.
Selectivity turns three numbers into language
8 minMamba makes B, C, and Δ functions of x_t. That single change — three small linear projections — is what unlocked language-quality performance.
GPUs hated selectivity until the parallel scan
8 minMaking A, B, C input-dependent breaks the convolutional fast path that made S4 GPU-friendly. The parallel scan is the algorithm that restored speed without sacrificing the model's expressiveness.
Mamba is excellent at flow, weak at lookup
8 minAn SSM's hidden state is a summary, not a transcript. Mamba beats transformers on language modeling perplexity but loses on tasks that require pulling a specific token back from far away.
Phase 4Predicting the 1M-Token Retrieval Winner
Predict the 1M-token retrieval winner from the architecture
Predict the 1M-token needle-in-a-haystack winner
20 minPredict the 1M-token needle-in-a-haystack winner
Frequently asked questions
- What is a state space model and how is it different from a transformer?
- This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is Mamba called 'selective' and what changed from S4?
- This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Does Mamba beat transformers on long-context retrieval like needle-in-a-haystack?
- This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What do the A, B, C, and Delta matrices actually do in an SSM?
- This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why did Mamba need a parallel scan to run efficiently on GPUs?
- This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.