🧮Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF
Stop picking quantization formats from Reddit threads. You'll separate algorithm, file format, and runtime kernel into three clean decisions — then justify any pick for Ollama, vLLM, or a single 4090.
Phase 1The Three-Layer Stack Hidden Inside Every Quant
Untangle algorithm, file format, and runtime kernel
A quant format is three decisions, not one
7 minA quant format is three decisions, not one
4-bit weights changed local LLM economics
6 min4-bit weights changed local LLM economics
GPTQ rounds carefully. AWQ rounds the important weights.
7 minGPTQ rounds carefully. AWQ rounds the important weights.
GGUF is a container, not an algorithm
7 minGGUF is a container, not an algorithm
Phase 2Running the Same Model Three Different Ways
Benchmark GGUF, AWQ, and EXL2 on identical prompts
Three runtimes, one model — set the experiment
6 minThree runtimes, one model — set the experiment
Run Llama-3-8B in GGUF Q4_K_M with llama.cpp
8 minRun Llama-3-8B in GGUF Q4_K_M with llama.cpp
Run the same model AWQ-quantized on vLLM
8 minRun the same model AWQ-quantized on vLLM
Run the same model EXL2 on ExLlamaV2
8 minRun the same model EXL2 on ExLlamaV2
Three rows in your table — what they actually mean
7 minThree rows in your table — what they actually mean
Phase 3Why Each Format Wins Its Battle
Trace why each format wins different battles
The Mac developer who thought AWQ would help
7 minThe Mac developer who thought AWQ would help
The H100 fleet that should not run GGUF
8 minThe H100 fleet that should not run GGUF
The 4090 user who needs Llama-70B to fit
8 minThe 4090 user who needs Llama-70B to fit
The model card that says AWQ — but ships Marlin-ready bytes
8 minThe model card that says AWQ — but ships Marlin-ready bytes
Phase 4Recommending Stacks You Can Defend
Recommend and justify formats for real deployments
Build your three-deployment recommendation memo
8 minBuild your three-deployment recommendation memo
Frequently asked questions
- What's the actual difference between GPTQ, AWQ, EXL2, and GGUF?
- This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Which quantization format is fastest on an RTX 4090?
- This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Can GGUF run on GPU or is it CPU-only?
- This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is Marlin-AWQ so much faster than naive AWQ?
- This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Should I use Q4_K_M or Q5_K_M for local Llama-3-8B?
- This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.