🟢 Stats: A/B Testing Pitfalls (Interviewers Love These)
Peeking — The #1 Pitfall Checking results daily and stopping when you see significance. A test designed for α=0.05 can have 20-30% actual false positive rates with repeated peeking.
Fix: Pre-commit to a sample size and duration. Or use sequential testing methods (e.g., always-valid p-values).
Multiple Testing Testing 20 metrics at α=0.05? Expect 1 false positive by pure chance.
Fix: Bonferroni correction (use α/k for k tests), or designate ONE primary metric beforehand.
Novelty Effect Users engage more with something just because it's new, not because it's better.
Fix: Run long enough for novelty to wear off (typically 2-4 weeks).
Simpson's Paradox Aggregate results can show the OPPOSITE of segment-level results.
Example: Treatment looks worse overall, but better in EVERY demographic — because the treatment group had proportionally more users from a harder-to-convert segment.
Fix: Always check segment-level results.
Network Effects On social platforms, treated and control users interact, contaminating results.
Fix: Cluster randomization (by geography, social graph, or time).