Design an A/B Testing Platform

Concepts tested: Randomization, statistical power, sample size calculation, multiple testing correction, sequential testing, novelty effects, feature flags

Problem Statement

Design an experimentation platform that allows teams to run A/B tests at scale. This question tests statistical knowledge combined with systems thinking.

Clarification Questions

Question	Design Impact
Scale (users, concurrent experiments)	Architecture complexity
Experiment types (A/B, multivariate, bandit)	Algorithm requirements
Metric types (clicks, revenue, time)	Data pipeline design
User access (self-service vs controlled)	Review and validation workflow

Experimentation Lifecycle

Loading diagram...

System Architecture

Loading diagram...

Core Components

1. Experiment Assignment

Assignment must satisfy three requirements:

Requirement	Description
Deterministic	Same user receives same variant consistently
Uniform	Traffic splits match configured percentages
Independent	Assignment in experiment A does not affect assignment in experiment B

Assignment Algorithm

Loading diagram...

Hash-based assignment rationale: Random assignment would produce different treatments on each page load, causing inconsistent user experience and violating the statistical assumption that each user receives one treatment throughout the experiment.

2. Sample Size Calculation

Sample size calculation prevents premature experiment termination.

Formula:

n = (2 * (Z_alpha + Z_beta)^2 * variance^2) / delta^2

Variable	Description	Typical Value
n	Sample size per group	Calculated
Z_alpha	Z-score for confidence	1.96 (95% confidence)
Z_beta	Z-score for power	0.84 (80% power)
variance	Standard deviation of metric	From historical data
delta	Minimum detectable effect (MDE)	Business requirement

Example Calculation:

Baseline conversion rate: 3%
Standard deviation: sqrt(0.03 * 0.97) = 0.17
MDE: 10% relative lift (0.3% absolute)
Result: n ≈ 35,000 per group

Reference Sample Sizes:

Metric Type	Typical MDE	Typical Sample Size
Click-through rate	2-5% relative	50K-200K per group
Conversion rate	3-10% relative	20K-100K per group
Revenue per user	5-10% relative	100K-500K per group

3. Statistical Analysis

Metric	Formula	Interpretation
Lift	(Treatment - Control) / Control	Percentage improvement
p-value	t-test or chi-squared	Probability result is due to chance
Confidence interval	Point estimate +/- margin of error	Range of likely true effect
Power	1 - beta	Probability of detecting real effect

Decision Framework:

	p < 0.05 (Significant)	p >= 0.05 (Not Significant)
Lift > 0	Ship	Inconclusive, need more data
Lift < 0	Do not ship	Inconclusive

Pre-shipping verification:

Guardrail metrics unchanged
Segment analysis (effect consistent across segments)
Novelty effect assessment (early vs late users)

Common Issues

Multiple Testing Problem

Number of Metrics Tested	Expected False Positives (alpha=0.05)
1	5%
10	40%
20	64%

Solutions:

Method	Approach
Bonferroni correction	alpha/n (conservative)
Benjamini-Hochberg	Control false discovery rate
Pre-registration	Single primary metric for decision

Novelty and Primacy Effects

Users initially engage differently with new features due to novelty or resistance to change.

Effect	Pattern	Mitigation
Novelty	Initial spike, then decline	Run 2-3 weeks minimum
Primacy	Initial dip, then recovery	Compare new vs returning users

Simpson's Paradox

Aggregate results can differ from segment results.

Example:

Segment	Control	Treatment	Lift
Mobile (80% of users)	2%	2.1%	+5%
Desktop (20% of users)	8%	8.5%	+6%
Overall	3.2%	3.1%	-3%

Explanation: Treatment shifts users from desktop to mobile, lowering overall average despite improving both segments.

Verification steps:

Segment breakdowns by platform, geography, user tenure
Sample Ratio Mismatch (SRM) check
User behavior shift analysis

Sample Ratio Mismatch (SRM)

If 50/50 split shows 52/48, the randomization is flawed.

Cause	Detection	Resolution
Assignment bug	Chi-squared test	Fix code, restart experiment
Bot traffic	User-agent analysis	Filter bots
Redirect issues	Check redirect implementation	Fix redirects
Analysis bug	Verify SQL queries	Fix query

Guardrail Metrics

Category	Example Metrics
Business	Revenue, transactions
Engagement	DAU, session length
Quality	Error rates, latency
Trust	Customer support tickets

Requirement: Even if primary metric shows positive result, check guardrails before shipping.

Advanced Topics

Sequential Testing

Sequential testing allows continuous monitoring with controlled error rates.

Approach	Description
Alpha spending	Pre-allocate significance level across analysis points
Always-valid p-values	Adjusts for multiple looks
Confidence sequences	Maintains valid confidence intervals throughout

Benefit: Can stop early if effect is large, saving time.

Requirement: Requires specialized statistical methods, not repeated t-tests.

Multi-Armed Bandits

Adaptive allocation shifts traffic to best-performing variant.

Approach	Use Case	Trade-off
A/B test	Learning effect size	All users see inferior treatment during experiment
Thompson Sampling	Maximizing reward	Harder to measure precise effect size
Epsilon-greedy	Simple implementation	Less efficient allocation

Summary

Component	Recommended Approach	Rationale
Assignment	Deterministic hash	Reproducible, consistent
Sample size	Pre-calculated, enforced	Statistical validity
Analysis	Pre-registered primary metric	Prevents p-hacking
Duration	Minimum 2 weeks	Captures novelty effects
Decision	Include guardrail check	Prevents unintended harm

Problem Statement​

Clarification Questions​

Experimentation Lifecycle​

System Architecture​

Core Components​

1. Experiment Assignment​

Assignment Algorithm​

2. Sample Size Calculation​

3. Statistical Analysis​

Common Issues​

Multiple Testing Problem​

Novelty and Primacy Effects​

Simpson's Paradox​

Sample Ratio Mismatch (SRM)​

Guardrail Metrics​

Advanced Topics​

Sequential Testing​

Multi-Armed Bandits​

Summary​

Table of Contents

Problem Statement

Clarification Questions

Experimentation Lifecycle

System Architecture

Core Components

1. Experiment Assignment

Assignment Algorithm

2. Sample Size Calculation

3. Statistical Analysis

Common Issues

Multiple Testing Problem

Novelty and Primacy Effects

Simpson's Paradox

Sample Ratio Mismatch (SRM)

Guardrail Metrics

Advanced Topics

Sequential Testing

Multi-Armed Bandits

Summary