⚡Understand FlashAttention and Tiling
Stop treating FlashAttention as a mystery flag — understand the tiling, online softmax, and HBM-vs-SRAM tradeoff that turn the same attention math into 2-4× speedups. By the end you can estimate FA's win for any sequence length on graph paper, before touching CUDA.
Phase 1Why attention bleeds memory before it bleeds math
See why attention is memory-bound, not compute-bound
Attention isn't slow — its memory traffic is
6 minAttention isn't slow — its memory traffic is
SRAM is the scratchpad nobody used
6 minSRAM is the scratchpad nobody used
The N×N matrix that never had to exist
7 minThe N×N matrix that never had to exist
Same math, completely different schedule
7 minSame math, completely different schedule
Phase 2Tiling and online softmax on graph paper
Run online softmax by hand across tiled blocks
Cut the matrix until it fits in your scratchpad
7 minCut the matrix until it fits in your scratchpad
The running-max trick that streams softmax exactly
8 minThe running-max trick that streams softmax exactly
Walk one Q tile through every K tile on paper
8 minWalk one Q tile through every K tile on paper
The backward pass that never stored the matrix
7 minThe backward pass that never stored the matrix
Causal masking that doesn't waste tiles
6 minCausal masking that doesn't waste tiles
Phase 3FA1 vs FA2 vs FA3 — same math, better schedule
Compare FA1, FA2, and FA3 as same-math reschedules
FA2 flipped the loop and doubled the speedup
7 minFA2 flipped the loop and doubled the speedup
FA3 stops waiting for memory and starts overlapping it
8 minFA3 stops waiting for memory and starts overlapping it
Your team turned on FA. Which version is running?
7 minYour team turned on FA. Which version is running?
PyTorch SDPA, xFormers, and the FA family tree
7 minPyTorch SDPA, xFormers, and the FA family tree
Phase 4Estimate FA speedup from a roofline you draw
Estimate FA speedup from a roofline you draw
Draw the roofline. Predict FA's win for your sequence.
8 minDraw the roofline. Predict FA's win for your sequence.
Frequently asked questions
- Is FlashAttention an approximation of regular attention?
- This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is standard attention memory-bound instead of compute-bound?
- This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is online softmax and why does it stay exact across tiles?
- This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does FlashAttention-2 differ from FlashAttention-1?
- This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What changes in FlashAttention-3 on Hopper GPUs?
- This is covered in the “Understand FlashAttention and Tiling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.