A/B Testing
A/B testing provides a statistical framework for measuring the effect of product changes. This section covers the statistical principles underlying A/B test design and analysis.
Statistical Framework
A/B testing is an application of hypothesis testing.
Null hypothesis (H0): No difference exists between control and treatment groups. Observed variation is due to random chance.
Alternative hypothesis (H1): A true difference exists between groups.
The objective is to collect sufficient evidence to reject the null hypothesis with controlled error rates.
Error Types and Trade-offs
| H0 Actually True | H0 Actually False | |
|---|---|---|
| Reject H0 | Type I error (false positive) | Correct decision |
| Fail to reject H0 | Correct decision | Type II error (missed effect) |
| Parameter | Definition | Typical Value |
|---|---|---|
| alpha (significance level) | Probability of false positive | 0.05 (5%) |
| beta | Probability of false negative | 0.20 (20%) |
| Power (1-beta) | Probability of detecting true effect | 0.80 (80%) |
Trade-off: Decreasing alpha reduces false positives but increases the probability of missing true effects. Both error rates cannot be minimized simultaneously.
Statistical Formulas
Comparing Conversion Rates (Proportions)
The z-statistic is calculated as the difference between two proportions (p1 - p2) divided by the standard error. The standard error is the square root of p*(1-p)*(1/n1 + 1/n2), where p is the pooled conversion rate calculated as total conversions divided by total sample size.
Comparing Averages (Continuous Metrics)
The t-statistic is calculated as the difference between two means divided by the standard error. The standard error is the square root of the sum of each group's variance divided by its sample size.
Sample Size Calculation
Sample size depends on four factors:
| Factor | Effect on Required Sample Size |
|---|---|
| Minimum Detectable Effect (MDE) | Smaller effects require larger samples |
| Metric variance | Higher variance requires larger samples |
| Significance level (alpha) | Lower alpha requires larger samples |
| Power (1-beta) | Higher power requires larger samples |
Calculation Example
Using statistical power analysis (available in libraries like statsmodels), you provide the effect size (e.g., 5% relative lift), desired power (e.g., 80%), significance level (e.g., 5%), and traffic split ratio (e.g., 1.0 for equal split). The function then calculates the required sample size per group.
Reference Values
To detect a 5% relative lift in a 3% baseline conversion rate with 80% power and 5% significance, approximately 35,000 users per group are required.
Common Interview Topics
Sample Size Determination
Factors to address: baseline metric value, minimum detectable effect, required power, significance level. Explain trade-offs between sample size and detectable effect size.
Statistical vs Practical Significance
A statistically significant result (p < 0.05) does not indicate practical importance. A 0.01% lift may be statistically significant with sufficient sample size but may not justify implementation costs. Evaluation should consider effect magnitude, implementation effort, and opportunity cost.
Multiple Comparisons
Testing 20 metrics produces an expected 1 false positive at alpha = 0.05. Solutions:
- Bonferroni correction: alpha/n
- False Discovery Rate control
- Pre-specification of primary metric
Early Stopping
Repeated significance testing inflates false positive rates. If a two-week test duration was calculated, results should not be evaluated before two weeks unless using sequential testing methods designed for continuous monitoring (e.g., alpha spending functions).
Ratio Metrics
Metrics like revenue per user present challenges because both numerator and denominator vary. Standard t-tests are not appropriate. Use delta method or bootstrap methods.
Common Errors
| Error | Consequence |
|---|---|
| Repeated result checking | Inflated false positive rate |
| Excessive variants (A/B/C/D/E) | Reduced statistical power per comparison |
| Ignoring practical significance | Resources spent on negligible improvements |
| No segment analysis | Missing heterogeneous effects (positive for some users, negative for others) |
| Early stopping based on positive results | Unreliable effect estimates; early users may not be representative |