Back to library

🐍Build Intuition for State Space Models and Mamba

Stop reading 'Mamba is linear-time attention' as marketing and start seeing the SSM as a controllable filter — A forgets, B absorbs, C reads out, Δ sets the clock. By the end you can predict whether Mamba or a transformer wins on a 1M-token retrieval task and justify it from the architecture.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1The SSM Bet Against Attention

See why attention scales badly and what a fixed state buys

4 drops
  1. Attention's quadratic cost is a tax, not a feature

    6 min

    Self-attention compares every token to every other token — that's where the O(n^2) comes from, and nothing about language requires it.

  2. An SSM is a controllable filter, not a new neural net

    6 min

    State space models come from control theory — they describe how a hidden state evolves over time under input, and how that state produces output. They predate deep learning by 60 years.

  3. A fixed hidden state is the whole gamble

    6 min

    An SSM compresses the entire past into a fixed-size vector. The model can only remember what fits in h, so the architecture's success or failure rides on how well h is used.

  4. Discretization is just picking a step size

    7 min

    SSMs are defined as continuous-time equations. To run them on a sequence, you discretize — and the step size Δ becomes a learnable parameter that controls how fast time passes for each input.

Phase 2Running an SSM Step by Hand

Run one SSM step by hand and feel A, B, C work

5 drops
  1. A is the forgetting knob

    7 min

    The A matrix decides how much of the previous state survives the next step. Its eigenvalues directly set the decay rate of memory.

  2. B decides what gets absorbed

    7 min

    B is the gate that lets new input into the state. It decides which input features matter and which directions of the hidden state they update.

  3. C is the readout — and what it ignores matters

    7 min

    C decides which parts of the hidden state become the output. The model can hold information in h that never reads out, and read out things it's currently not storing.

  4. Walk one step on a 5-token sequence

    8 min

    Three matrices, one state vector, five tokens — that's enough to feel the full SSM update on a piece of paper.

  5. An SSM is a convolution in disguise

    7 min

    If A, B, C are fixed, you can unroll the recurrence and discover the output is a convolution of the input with a learned filter. Same math, different shape.

Phase 3From S4 to Selective Mamba

Watch selectivity turn S4 into Mamba's language model

4 drops
  1. S4 conquered audio but stumbled on language

    8 min

    Fixed-parameter SSMs are perfect for signals where every sample matters equally. Language isn't like that — and S4's content-blind filtering left performance on the table.

  2. Selectivity turns three numbers into language

    8 min

    Mamba makes B, C, and Δ functions of x_t. That single change — three small linear projections — is what unlocked language-quality performance.

  3. GPUs hated selectivity until the parallel scan

    8 min

    Making A, B, C input-dependent breaks the convolutional fast path that made S4 GPU-friendly. The parallel scan is the algorithm that restored speed without sacrificing the model's expressiveness.

  4. Mamba is excellent at flow, weak at lookup

    8 min

    An SSM's hidden state is a summary, not a transcript. Mamba beats transformers on language modeling perplexity but loses on tasks that require pulling a specific token back from far away.

Phase 4Predicting the 1M-Token Retrieval Winner

Predict the 1M-token retrieval winner from the architecture

1 drop
  1. Predict the 1M-token needle-in-a-haystack winner

    20 min

    Predict the 1M-token needle-in-a-haystack winner

Frequently asked questions

What is a state space model and how is it different from a transformer?
This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is Mamba called 'selective' and what changed from S4?
This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Does Mamba beat transformers on long-context retrieval like needle-in-a-haystack?
This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What do the A, B, C, and Delta matrices actually do in an SSM?
This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why did Mamba need a parallel scan to run efficiently on GPUs?
This is covered in the “Build Intuition for State Space Models and Mamba” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.