🧭Understand MoE Routing and Load Balancing
Open the MoE router black box piece by piece — softmax gate, top-k, auxiliary loss, capacity factor, token dropping — until you can predict how capacity factor 1.0 versus 1.25 changes wasted compute and dropped tokens, then verify with an ablation.
Phase 1From dense FFN to conditional routing
Why dense FFNs waste compute and what conditional routing replaces them with
A dense FFN runs every neuron on every token, even when it shouldn't
6 minA dense FFN runs every neuron on every token, even when it shouldn't
The router is a single linear layer plus softmax
6 minThe router is a single linear layer plus softmax
Without a load-balancing loss, the router picks two experts forever
7 minWithout a load-balancing loss, the router picks two experts forever
Capacity factor decides how many tokens each expert can refuse
7 minCapacity factor decides how many tokens each expert can refuse
Phase 2Trace a token through the router
Trace one token through softmax gate, top-k, and auxiliary loss
From hidden state to gate probabilities in three lines of code
6 minFrom hidden state to gate probabilities in three lines of code
Switch routing sends each token to exactly one expert
6 minSwitch routing sends each token to exactly one expert
The auxiliary loss is one tensor product, computed per batch
7 minThe auxiliary loss is one tensor product, computed per batch
All-to-all is where routing becomes a distributed-systems problem
8 minAll-to-all is where routing becomes a distributed-systems problem
Dropped tokens skip the MoE layer entirely — only the residual survives
7 minDropped tokens skip the MoE layer entirely — only the residual survives
Phase 3How real MoEs trade off routing choices
How Switch, Mixtral, and DeepSeek-MoE pick different points on the same axis
A teammate proposes 'just use top-1 like Switch' to halve training cost
7 minA teammate proposes 'just use top-1 like Switch' to halve training cost
Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?
8 minMixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?
Training loss is fine, but the utilization plot shows two hot experts
8 minTraining loss is fine, but the utilization plot shows two hot experts
Your inference batch is small — does the router behave the same?
8 minYour inference batch is small — does the router behave the same?
Phase 4Predict and verify capacity-factor effects
Predict capacity-factor effects on an imbalanced batch, then ablate
Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify
12 minPredict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify
Frequently asked questions
- What is the router in a mixture-of-experts model actually doing?
- This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does MoE training need an auxiliary load-balancing loss?
- This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What does 'capacity factor 1.25' mean and why does raising it matter?
- This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is top-1 (Switch) routing different from top-2 (Mixtral) routing?
- This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do MoE models drop tokens, and when is that acceptable?
- This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.