Skip to main content

A/B Testing

A/B testing provides a statistical framework for measuring the effect of product changes. This section covers the statistical principles underlying A/B test design and analysis.

Statistical Framework

A/B testing is an application of hypothesis testing.

Null hypothesis (H0): No difference exists between control and treatment groups. Observed variation is due to random chance.

Alternative hypothesis (H1): A true difference exists between groups.

The objective is to collect sufficient evidence to reject the null hypothesis with controlled error rates.

Error Types and Trade-offs

H0 Actually TrueH0 Actually False
Reject H0Type I error (false positive)Correct decision
Fail to reject H0Correct decisionType II error (missed effect)
ParameterDefinitionTypical Value
alpha (significance level)Probability of false positive0.05 (5%)
betaProbability of false negative0.20 (20%)
Power (1-beta)Probability of detecting true effect0.80 (80%)

Trade-off: Decreasing alpha reduces false positives but increases the probability of missing true effects. Both error rates cannot be minimized simultaneously.

Statistical Formulas

Comparing Conversion Rates (Proportions)

The z-statistic is calculated as the difference between two proportions (p1 - p2) divided by the standard error. The standard error is the square root of p*(1-p)*(1/n1 + 1/n2), where p is the pooled conversion rate calculated as total conversions divided by total sample size.

Comparing Averages (Continuous Metrics)

The t-statistic is calculated as the difference between two means divided by the standard error. The standard error is the square root of the sum of each group's variance divided by its sample size.

Sample Size Calculation

Sample size depends on four factors:

FactorEffect on Required Sample Size
Minimum Detectable Effect (MDE)Smaller effects require larger samples
Metric varianceHigher variance requires larger samples
Significance level (alpha)Lower alpha requires larger samples
Power (1-beta)Higher power requires larger samples

Calculation Example

Using statistical power analysis (available in libraries like statsmodels), you provide the effect size (e.g., 5% relative lift), desired power (e.g., 80%), significance level (e.g., 5%), and traffic split ratio (e.g., 1.0 for equal split). The function then calculates the required sample size per group.

Reference Values

To detect a 5% relative lift in a 3% baseline conversion rate with 80% power and 5% significance, approximately 35,000 users per group are required.

Common Interview Topics

Sample Size Determination

Factors to address: baseline metric value, minimum detectable effect, required power, significance level. Explain trade-offs between sample size and detectable effect size.

Statistical vs Practical Significance

A statistically significant result (p < 0.05) does not indicate practical importance. A 0.01% lift may be statistically significant with sufficient sample size but may not justify implementation costs. Evaluation should consider effect magnitude, implementation effort, and opportunity cost.

Multiple Comparisons

Testing 20 metrics produces an expected 1 false positive at alpha = 0.05. Solutions:

  • Bonferroni correction: alpha/n
  • False Discovery Rate control
  • Pre-specification of primary metric

Early Stopping

Repeated significance testing inflates false positive rates. If a two-week test duration was calculated, results should not be evaluated before two weeks unless using sequential testing methods designed for continuous monitoring (e.g., alpha spending functions).

Ratio Metrics

Metrics like revenue per user present challenges because both numerator and denominator vary. Standard t-tests are not appropriate. Use delta method or bootstrap methods.

Common Errors

ErrorConsequence
Repeated result checkingInflated false positive rate
Excessive variants (A/B/C/D/E)Reduced statistical power per comparison
Ignoring practical significanceResources spent on negligible improvements
No segment analysisMissing heterogeneous effects (positive for some users, negative for others)
Early stopping based on positive resultsUnreliable effect estimates; early users may not be representative