What are the three stages of ZeRO?

This is covered in the "Understand ZeRO and Its Three Stages" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What's the difference between ZeRO-1, ZeRO-2, and ZeRO-3?

This is covered in the "Understand ZeRO and Its Three Stages" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How does ZeRO compare to FSDP in PyTorch?

This is covered in the "Understand ZeRO and Its Three Stages" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does ZeRO-3 cost more communication than ZeRO-2?

This is covered in the "Understand ZeRO and Its Three Stages" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

When should I use ZeRO-Infinity instead of ZeRO-3?

This is covered in the "Understand ZeRO and Its Three Stages" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🧊Understand ZeRO and Its Three Stages

Pencil-and-paper your way through ZeRO stages 1, 2, and 3 — sharding optimizer state, then gradients, then params — until you can pick a stage for a 13B model on 8 A100s and justify it from memory math, not vibes.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Where GPU Memory Actually Goes

See what eats GPU memory before FLOPs do

4 drops

Your GPU runs out of memory long before it runs out of FLOPs
7 min
Your GPU runs out of memory long before it runs out of FLOPs
Plain data-parallel replicates everything — even the optimizer
6 min
Plain data-parallel replicates everything — even the optimizer
ZeRO has three stages because there are three things to shard
7 min
ZeRO has three stages because there are three things to shard
Memory savings aren't free — every stage adds communication
7 min
Memory savings aren't free — every stage adds communication

Phase 2Sharding Tensors One Stage at a Time

Shard one tensor at a time, counting comms

5 drops

Shard one optimizer across four GPUs with a pencil
7 min
Shard one optimizer across four GPUs with a pencil
Adding gradients to the sharded list costs you nothing
7 min
Adding gradients to the sharded list costs you nothing
Sharding parameters means every layer needs an all-gather
8 min
Sharding parameters means every layer needs an all-gather
ZeRO doesn't touch activations — that's a separate fight
7 min
ZeRO doesn't touch activations — that's a separate fight
Comms time = bytes ÷ bandwidth — and you can predict it
8 min
Comms time = bytes ÷ bandwidth — and you can predict it

Phase 3ZeRO Across Frameworks and Tiers

Map ZeRO onto FSDP and ZeRO-Infinity

4 drops

Your config says FullyShard but the doc points to ZeRO-3 — which is it?
7 min
Your config says FullyShard but the doc points to ZeRO-3 — which is it?
Your CPU RAM and NVMe are just slower GPUs
7 min
Your CPU RAM and NVMe are just slower GPUs
ZeRO inside a node, pipeline across — and other Frankenstein configs
8 min
ZeRO inside a node, pipeline across — and other Frankenstein configs
The decision tree fits on one napkin
8 min
The decision tree fits on one napkin

Phase 4Pick a Stage and Defend It

Pick a stage for a 13B model on 8 A100s

1 drop

Pick a stage for 13B on 8 A100s and write the memo
8 min
Pick a stage for 13B on 8 A100s and write the memo

Frequently asked questions

What are the three stages of ZeRO?: This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between ZeRO-1, ZeRO-2, and ZeRO-3?: This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does ZeRO compare to FSDP in PyTorch?: This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does ZeRO-3 cost more communication than ZeRO-2?: This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When should I use ZeRO-Infinity instead of ZeRO-3?: This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧊Understand ZeRO and Its Three Stages

Phase 1Where GPU Memory Actually Goes

Your GPU runs out of memory long before it runs out of FLOPs

Plain data-parallel replicates everything — even the optimizer

ZeRO has three stages because there are three things to shard

Memory savings aren't free — every stage adds communication

Phase 2Sharding Tensors One Stage at a Time

Shard one optimizer across four GPUs with a pencil

Adding gradients to the sharded list costs you nothing

Sharding parameters means every layer needs an all-gather

ZeRO doesn't touch activations — that's a separate fight

Comms time = bytes ÷ bandwidth — and you can predict it

Phase 3ZeRO Across Frameworks and Tiers

Your config says FullyShard but the doc points to ZeRO-3 — which is it?

Your CPU RAM and NVMe are just slower GPUs

ZeRO inside a node, pipeline across — and other Frankenstein configs

The decision tree fits on one napkin

Phase 4Pick a Stage and Defend It

Pick a stage for 13B on 8 A100s and write the memo

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Where GPU Memory Actually Goes

Your GPU runs out of memory long before it runs out of FLOPs

Plain data-parallel replicates everything — even the optimizer

ZeRO has three stages because there are three things to shard

Memory savings aren't free — every stage adds communication

Phase 2Sharding Tensors One Stage at a Time

Shard one optimizer across four GPUs with a pencil

Adding gradients to the sharded list costs you nothing

Sharding parameters means every layer needs an all-gather

ZeRO doesn't touch activations — that's a separate fight

Comms time = bytes ÷ bandwidth — and you can predict it

Phase 3ZeRO Across Frameworks and Tiers

Your config says FullyShard but the doc points to ZeRO-3 — which is it?

Your CPU RAM and NVMe are just slower GPUs

ZeRO inside a node, pipeline across — and other Frankenstein configs

The decision tree fits on one napkin

Phase 4Pick a Stage and Defend It

Pick a stage for 13B on 8 A100s and write the memo

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition