Back to library

🧮Understand Gradient Checkpointing

Stop guessing why gradient checkpointing tanks your throughput by 30% — learn to read the activation tape, pick the right granularity, and predict the compute overhead before you launch a single training run.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Activations Eat Your Memory

See why activations — not weights — eat your training memory

4 drops
  1. Weights aren't what fills your GPU — activations are

    6 min

    Weights aren't what fills your GPU — activations are

  2. Forward writes the tape; backward reads it in reverse

    6 min

    Forward writes the tape; backward reads it in reverse

  3. Forget on purpose, recompute on demand

    6 min

    Forget on purpose, recompute on demand

  4. Every N layers is a dial, not a switch

    7 min

    Every N layers is a dial, not a switch

Phase 2Hand-Trace a Checkpointed Net

Hand-trace a 4-layer net and count peak activations

5 drops
  1. Count peak activations on a 4-layer net before any tricks

    7 min

    Count peak activations on a 4-layer net before any tricks

  2. Cut the saved set in half, do forward twice for half of it

    7 min

    Cut the saved set in half, do forward twice for half of it

  3. Count recompute by op, not by layer count

    7 min

    Count recompute by op, not by layer count

  4. Peak memory hits once — average matters for throughput

    6 min

    Peak memory hits once — average matters for throughput

  5. Trace once for full, every-2, and every-layer — same net

    7 min

    Trace once for full, every-2, and every-layer — same net

Phase 3Choosing the Right Policy

Pick full, selective, or offload — match policy to bottleneck

4 drops
  1. Your team enabled full checkpointing and lost 30% throughput

    7 min

    Your team enabled full checkpointing and lost 30% throughput

  2. Your 70B model needs more than checkpointing can give

    7 min

    Your 70B model needs more than checkpointing can give

  3. FlashAttention already recomputes attention — checkpointing twice is wasted work

    7 min

    FlashAttention already recomputes attention — checkpointing twice is wasted work

  4. Pick the cheap-FLOPs, high-memory ops to checkpoint

    8 min

    Pick the cheap-FLOPs, high-memory ops to checkpoint

Phase 4Prescribe a Real Checkpoint Policy

Diagnose a real OOM and prescribe a checkpoint policy

1 drop
  1. Take a real OOM and write the policy

    18 min

    Take a real OOM and write the policy

Frequently asked questions

What is gradient checkpointing in plain terms?
This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How much memory does gradient checkpointing actually save?
This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does gradient checkpointing slow down training by 20-30%?
This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When should I use selective checkpointing instead of full?
This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is gradient checkpointing different from activation offloading?
This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.