stats

🟢 Stats: A/B Testing Pitfalls (Interviewers Love These)

Peeking — The #1 Pitfall Checking results daily and stopping when you see significance. A test designed for α=0.05 can have 20-30% actual false positive rates with repeated peeking.

Fix: Pre-commit to a sample size and duration. Or use sequential testing methods (e.g., always-valid p-values).

Multiple Testing Testing 20 metrics at α=0.05? Expect 1 false positive by pure chance.

Fix: Bonferroni correction (use α/k for k tests), or designate ONE primary metric beforehand.

Novelty Effect Users engage more with something just because it's new, not because it's better.

Fix: Run long enough for novelty to wear off (typically 2-4 weeks).

Simpson's Paradox Aggregate results can show the OPPOSITE of segment-level results.

Example: Treatment looks worse overall, but better in EVERY demographic — because the treatment group had proportionally more users from a harder-to-convert segment.

Fix: Always check segment-level results.

Network Effects On social platforms, treated and control users interact, contaminating results.

Fix: Cluster randomization (by geography, social graph, or time).

Practice Questions

Q: You run an A/B test for 3 days, see p=0.02, and your PM wants to ship immediately. What do you say?