Back to library

🧊Understand ZeRO and Its Three Stages

Pencil-and-paper your way through ZeRO stages 1, 2, and 3 — sharding optimizer state, then gradients, then params — until you can pick a stage for a 13B model on 8 A100s and justify it from memory math, not vibes.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Where GPU Memory Actually Goes

See what eats GPU memory before FLOPs do

4 drops
  1. Your GPU runs out of memory long before it runs out of FLOPs

    7 min

    Your GPU runs out of memory long before it runs out of FLOPs

  2. Plain data-parallel replicates everything — even the optimizer

    6 min

    Plain data-parallel replicates everything — even the optimizer

  3. ZeRO has three stages because there are three things to shard

    7 min

    ZeRO has three stages because there are three things to shard

  4. Memory savings aren't free — every stage adds communication

    7 min

    Memory savings aren't free — every stage adds communication

Phase 2Sharding Tensors One Stage at a Time

Shard one tensor at a time, counting comms

5 drops
  1. Shard one optimizer across four GPUs with a pencil

    7 min

    Shard one optimizer across four GPUs with a pencil

  2. Adding gradients to the sharded list costs you nothing

    7 min

    Adding gradients to the sharded list costs you nothing

  3. Sharding parameters means every layer needs an all-gather

    8 min

    Sharding parameters means every layer needs an all-gather

  4. ZeRO doesn't touch activations — that's a separate fight

    7 min

    ZeRO doesn't touch activations — that's a separate fight

  5. Comms time = bytes ÷ bandwidth — and you can predict it

    8 min

    Comms time = bytes ÷ bandwidth — and you can predict it

Phase 3ZeRO Across Frameworks and Tiers

Map ZeRO onto FSDP and ZeRO-Infinity

4 drops
  1. Your config says FullyShard but the doc points to ZeRO-3 — which is it?

    7 min

    Your config says FullyShard but the doc points to ZeRO-3 — which is it?

  2. Your CPU RAM and NVMe are just slower GPUs

    7 min

    Your CPU RAM and NVMe are just slower GPUs

  3. ZeRO inside a node, pipeline across — and other Frankenstein configs

    8 min

    ZeRO inside a node, pipeline across — and other Frankenstein configs

  4. The decision tree fits on one napkin

    8 min

    The decision tree fits on one napkin

Phase 4Pick a Stage and Defend It

Pick a stage for a 13B model on 8 A100s

1 drop
  1. Pick a stage for 13B on 8 A100s and write the memo

    8 min

    Pick a stage for 13B on 8 A100s and write the memo

Frequently asked questions

What are the three stages of ZeRO?
This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between ZeRO-1, ZeRO-2, and ZeRO-3?
This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does ZeRO compare to FSDP in PyTorch?
This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does ZeRO-3 cost more communication than ZeRO-2?
This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When should I use ZeRO-Infinity instead of ZeRO-3?
This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.