🧮Understand Data, Tensor, and Pipeline Parallelism
Walk one toy 4-layer model through every parallelism axis — DP, TP, PP — until the geometry sticks. By drop 14 you can pick a (DP, TP, PP) tuple for a 70B model on 64 GPUs and defend it from a cost model.
Phase 1The Three Axes You Can Split
Meet the three axes you can split
Three different things people call 'parallel'
6 minThree different things people call 'parallel'
Same model, different examples, sum the gradients
6 minSame model, different examples, sum the gradients
Cut the matmul, not the batch
7 minCut the matmul, not the batch
Stack the layers across machines
7 minStack the layers across machines
Phase 2Tracing Four GPUs, Three Ways
Trace forward and backward on four GPUs
One forward pass under pure data parallelism
7 minOne forward pass under pure data parallelism
Activations crossing the fabric every layer
7 minActivations crossing the fabric every layer
Microbatches filling the pipeline
7 minMicrobatches filling the pipeline
Stack DP, TP, and PP on the same eight GPUs
8 minStack DP, TP, and PP on the same eight GPUs
Counting bytes per GPU under each regime
7 minCounting bytes per GPU under each regime
Phase 3Cost Models and When OOM Hits
Choose the right axis when OOM hits
Your 13B training run just crashed at step 4,000
7 minYour 13B training run just crashed at step 4,000
Your throughput dropped 40% after moving to a bigger cluster
7 minYour throughput dropped 40% after moving to a bigger cluster
Your pipeline has a 35% bubble and you can't increase M
8 minYour pipeline has a 35% bubble and you can't increase M
Choose (DP, TP, PP) for a 30B model on 32 GPUs
8 minChoose (DP, TP, PP) for a 30B model on 32 GPUs
Phase 4Sizing a 70B Cluster
Defend a (DP, TP, PP) tuple for 70B
Defend a (DP, TP, PP) tuple for a 70B model on 64 GPUs
8 minDefend a (DP, TP, PP) tuple for a 70B model on 64 GPUs
Frequently asked questions
- What is the difference between data, tensor, and pipeline parallelism?
- This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When should you use tensor parallelism instead of data parallelism?
- This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does pipeline parallelism need microbatching?
- This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you choose a (DP, TP, PP) configuration for a large model?
- This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is tensor parallelism limited to a single node?
- This is covered in the “Understand Data, Tensor, and Pipeline Parallelism” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.