Back to library

🧮Understand Data, Tensor, and Pipeline Parallelism

Walk one toy 4-layer model through every parallelism axis — DP, TP, PP — until the geometry sticks. By drop 14 you can pick a (DP, TP, PP) tuple for a 70B model on 64 GPUs and defend it from a cost model.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1The Three Axes You Can Split

Meet the three axes you can split

4 drops
  1. Three different things people call 'parallel'

    6 min

    Three different things people call 'parallel'

  2. Same model, different examples, sum the gradients

    6 min

    Same model, different examples, sum the gradients

  3. Cut the matmul, not the batch

    7 min

    Cut the matmul, not the batch

  4. Stack the layers across machines

    7 min

    Stack the layers across machines

Phase 2Tracing Four GPUs, Three Ways

Trace forward and backward on four GPUs

5 drops
  1. One forward pass under pure data parallelism

    7 min

    One forward pass under pure data parallelism

  2. Activations crossing the fabric every layer

    7 min

    Activations crossing the fabric every layer

  3. Microbatches filling the pipeline

    7 min

    Microbatches filling the pipeline

  4. Stack DP, TP, and PP on the same eight GPUs

    8 min

    Stack DP, TP, and PP on the same eight GPUs

  5. Counting bytes per GPU under each regime

    7 min

    Counting bytes per GPU under each regime

Phase 3Cost Models and When OOM Hits

Choose the right axis when OOM hits

4 drops
  1. Your 13B training run just crashed at step 4,000

    7 min

    Your 13B training run just crashed at step 4,000

  2. Your throughput dropped 40% after moving to a bigger cluster

    7 min

    Your throughput dropped 40% after moving to a bigger cluster

  3. Your pipeline has a 35% bubble and you can't increase M

    8 min

    Your pipeline has a 35% bubble and you can't increase M

  4. Choose (DP, TP, PP) for a 30B model on 32 GPUs

    8 min

    Choose (DP, TP, PP) for a 30B model on 32 GPUs

Phase 4Sizing a 70B Cluster

Defend a (DP, TP, PP) tuple for 70B

1 drop
  1. Defend a (DP, TP, PP) tuple for a 70B model on 64 GPUs

    8 min

    Defend a (DP, TP, PP) tuple for a 70B model on 64 GPUs

Frequently asked questions

What is the difference between data, tensor, and pipeline parallelism?
This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When should you use tensor parallelism instead of data parallelism?
This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does pipeline parallelism need microbatching?
This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do you choose a (DP, TP, PP) configuration for a large model?
This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is tensor parallelism limited to a single node?
This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.