Design an A/B Testing Platform
Concepts tested: Randomization, statistical power, sample size calculation, multiple testing correction, sequential testing, novelty effects, feature flags
Problem Statement
Design an experimentation platform that allows teams to run A/B tests at scale. This question tests statistical knowledge combined with systems thinking.
Clarification Questions
| Question | Design Impact |
|---|---|
| Scale (users, concurrent experiments) | Architecture complexity |
| Experiment types (A/B, multivariate, bandit) | Algorithm requirements |
| Metric types (clicks, revenue, time) | Data pipeline design |
| User access (self-service vs controlled) | Review and validation workflow |
Experimentation Lifecycle
System Architecture
Core Components
1. Experiment Assignment
Assignment must satisfy three requirements:
| Requirement | Description |
|---|---|
| Deterministic | Same user receives same variant consistently |
| Uniform | Traffic splits match configured percentages |
| Independent | Assignment in experiment A does not affect assignment in experiment B |
Assignment Algorithm
Hash-based assignment rationale: Random assignment would produce different treatments on each page load, causing inconsistent user experience and violating the statistical assumption that each user receives one treatment throughout the experiment.
2. Sample Size Calculation
Sample size calculation prevents premature experiment termination.
Formula:
n = (2 * (Z_alpha + Z_beta)^2 * variance^2) / delta^2
| Variable | Description | Typical Value |
|---|---|---|
| n | Sample size per group | Calculated |
| Z_alpha | Z-score for confidence | 1.96 (95% confidence) |
| Z_beta | Z-score for power | 0.84 (80% power) |
| variance | Standard deviation of metric | From historical data |
| delta | Minimum detectable effect (MDE) | Business requirement |
Example Calculation:
- Baseline conversion rate: 3%
- Standard deviation: sqrt(0.03 * 0.97) = 0.17
- MDE: 10% relative lift (0.3% absolute)
- Result: n ≈ 35,000 per group
Reference Sample Sizes:
| Metric Type | Typical MDE | Typical Sample Size |
|---|---|---|
| Click-through rate | 2-5% relative | 50K-200K per group |
| Conversion rate | 3-10% relative | 20K-100K per group |
| Revenue per user | 5-10% relative | 100K-500K per group |
3. Statistical Analysis
| Metric | Formula | Interpretation |
|---|---|---|
| Lift | (Treatment - Control) / Control | Percentage improvement |
| p-value | t-test or chi-squared | Probability result is due to chance |
| Confidence interval | Point estimate +/- margin of error | Range of likely true effect |
| Power | 1 - beta | Probability of detecting real effect |
Decision Framework:
| p < 0.05 (Significant) | p >= 0.05 (Not Significant) | |
|---|---|---|
| Lift > 0 | Ship | Inconclusive, need more data |
| Lift < 0 | Do not ship | Inconclusive |
Pre-shipping verification:
- Guardrail metrics unchanged
- Segment analysis (effect consistent across segments)
- Novelty effect assessment (early vs late users)
Common Issues
Multiple Testing Problem
| Number of Metrics Tested | Expected False Positives (alpha=0.05) |
|---|---|
| 1 | 5% |
| 10 | 40% |
| 20 | 64% |
Solutions:
| Method | Approach |
|---|---|
| Bonferroni correction | alpha/n (conservative) |
| Benjamini-Hochberg | Control false discovery rate |
| Pre-registration | Single primary metric for decision |
Novelty and Primacy Effects
Users initially engage differently with new features due to novelty or resistance to change.
| Effect | Pattern | Mitigation |
|---|---|---|
| Novelty | Initial spike, then decline | Run 2-3 weeks minimum |
| Primacy | Initial dip, then recovery | Compare new vs returning users |
Simpson's Paradox
Aggregate results can differ from segment results.
Example:
| Segment | Control | Treatment | Lift |
|---|---|---|---|
| Mobile (80% of users) | 2% | 2.1% | +5% |
| Desktop (20% of users) | 8% | 8.5% | +6% |
| Overall | 3.2% | 3.1% | -3% |
Explanation: Treatment shifts users from desktop to mobile, lowering overall average despite improving both segments.
Verification steps:
- Segment breakdowns by platform, geography, user tenure
- Sample Ratio Mismatch (SRM) check
- User behavior shift analysis
Sample Ratio Mismatch (SRM)
If 50/50 split shows 52/48, the randomization is flawed.
| Cause | Detection | Resolution |
|---|---|---|
| Assignment bug | Chi-squared test | Fix code, restart experiment |
| Bot traffic | User-agent analysis | Filter bots |
| Redirect issues | Check redirect implementation | Fix redirects |
| Analysis bug | Verify SQL queries | Fix query |
Guardrail Metrics
| Category | Example Metrics |
|---|---|
| Business | Revenue, transactions |
| Engagement | DAU, session length |
| Quality | Error rates, latency |
| Trust | Customer support tickets |
Requirement: Even if primary metric shows positive result, check guardrails before shipping.
Advanced Topics
Sequential Testing
Sequential testing allows continuous monitoring with controlled error rates.
| Approach | Description |
|---|---|
| Alpha spending | Pre-allocate significance level across analysis points |
| Always-valid p-values | Adjusts for multiple looks |
| Confidence sequences | Maintains valid confidence intervals throughout |
Benefit: Can stop early if effect is large, saving time.
Requirement: Requires specialized statistical methods, not repeated t-tests.
Multi-Armed Bandits
Adaptive allocation shifts traffic to best-performing variant.
| Approach | Use Case | Trade-off |
|---|---|---|
| A/B test | Learning effect size | All users see inferior treatment during experiment |
| Thompson Sampling | Maximizing reward | Harder to measure precise effect size |
| Epsilon-greedy | Simple implementation | Less efficient allocation |
Summary
| Component | Recommended Approach | Rationale |
|---|---|---|
| Assignment | Deterministic hash | Reproducible, consistent |
| Sample size | Pre-calculated, enforced | Statistical validity |
| Analysis | Pre-registered primary metric | Prevents p-hacking |
| Duration | Minimum 2 weeks | Captures novelty effects |
| Decision | Include guardrail check | Prevents unintended harm |