Skip to main content

Design an A/B Testing Platform

Concepts tested: Randomization, statistical power, sample size calculation, multiple testing correction, sequential testing, novelty effects, feature flags

Problem Statement

Design an experimentation platform that allows teams to run A/B tests at scale. This question tests statistical knowledge combined with systems thinking.

Clarification Questions

QuestionDesign Impact
Scale (users, concurrent experiments)Architecture complexity
Experiment types (A/B, multivariate, bandit)Algorithm requirements
Metric types (clicks, revenue, time)Data pipeline design
User access (self-service vs controlled)Review and validation workflow

Experimentation Lifecycle

Loading diagram...

System Architecture

Loading diagram...

Core Components

1. Experiment Assignment

Assignment must satisfy three requirements:

RequirementDescription
DeterministicSame user receives same variant consistently
UniformTraffic splits match configured percentages
IndependentAssignment in experiment A does not affect assignment in experiment B

Assignment Algorithm

Loading diagram...

Hash-based assignment rationale: Random assignment would produce different treatments on each page load, causing inconsistent user experience and violating the statistical assumption that each user receives one treatment throughout the experiment.

2. Sample Size Calculation

Sample size calculation prevents premature experiment termination.

Formula:

n = (2 * (Z_alpha + Z_beta)^2 * variance^2) / delta^2
VariableDescriptionTypical Value
nSample size per groupCalculated
Z_alphaZ-score for confidence1.96 (95% confidence)
Z_betaZ-score for power0.84 (80% power)
varianceStandard deviation of metricFrom historical data
deltaMinimum detectable effect (MDE)Business requirement

Example Calculation:

  • Baseline conversion rate: 3%
  • Standard deviation: sqrt(0.03 * 0.97) = 0.17
  • MDE: 10% relative lift (0.3% absolute)
  • Result: n ≈ 35,000 per group

Reference Sample Sizes:

Metric TypeTypical MDETypical Sample Size
Click-through rate2-5% relative50K-200K per group
Conversion rate3-10% relative20K-100K per group
Revenue per user5-10% relative100K-500K per group

3. Statistical Analysis

MetricFormulaInterpretation
Lift(Treatment - Control) / ControlPercentage improvement
p-valuet-test or chi-squaredProbability result is due to chance
Confidence intervalPoint estimate +/- margin of errorRange of likely true effect
Power1 - betaProbability of detecting real effect

Decision Framework:

p < 0.05 (Significant)p >= 0.05 (Not Significant)
Lift > 0ShipInconclusive, need more data
Lift < 0Do not shipInconclusive

Pre-shipping verification:

  • Guardrail metrics unchanged
  • Segment analysis (effect consistent across segments)
  • Novelty effect assessment (early vs late users)

Common Issues

Multiple Testing Problem

Number of Metrics TestedExpected False Positives (alpha=0.05)
15%
1040%
2064%

Solutions:

MethodApproach
Bonferroni correctionalpha/n (conservative)
Benjamini-HochbergControl false discovery rate
Pre-registrationSingle primary metric for decision

Novelty and Primacy Effects

Users initially engage differently with new features due to novelty or resistance to change.

EffectPatternMitigation
NoveltyInitial spike, then declineRun 2-3 weeks minimum
PrimacyInitial dip, then recoveryCompare new vs returning users

Simpson's Paradox

Aggregate results can differ from segment results.

Example:

SegmentControlTreatmentLift
Mobile (80% of users)2%2.1%+5%
Desktop (20% of users)8%8.5%+6%
Overall3.2%3.1%-3%

Explanation: Treatment shifts users from desktop to mobile, lowering overall average despite improving both segments.

Verification steps:

  • Segment breakdowns by platform, geography, user tenure
  • Sample Ratio Mismatch (SRM) check
  • User behavior shift analysis

Sample Ratio Mismatch (SRM)

If 50/50 split shows 52/48, the randomization is flawed.

CauseDetectionResolution
Assignment bugChi-squared testFix code, restart experiment
Bot trafficUser-agent analysisFilter bots
Redirect issuesCheck redirect implementationFix redirects
Analysis bugVerify SQL queriesFix query

Guardrail Metrics

CategoryExample Metrics
BusinessRevenue, transactions
EngagementDAU, session length
QualityError rates, latency
TrustCustomer support tickets

Requirement: Even if primary metric shows positive result, check guardrails before shipping.

Advanced Topics

Sequential Testing

Sequential testing allows continuous monitoring with controlled error rates.

ApproachDescription
Alpha spendingPre-allocate significance level across analysis points
Always-valid p-valuesAdjusts for multiple looks
Confidence sequencesMaintains valid confidence intervals throughout

Benefit: Can stop early if effect is large, saving time.

Requirement: Requires specialized statistical methods, not repeated t-tests.

Multi-Armed Bandits

Adaptive allocation shifts traffic to best-performing variant.

ApproachUse CaseTrade-off
A/B testLearning effect sizeAll users see inferior treatment during experiment
Thompson SamplingMaximizing rewardHarder to measure precise effect size
Epsilon-greedySimple implementationLess efficient allocation

Summary

ComponentRecommended ApproachRationale
AssignmentDeterministic hashReproducible, consistent
Sample sizePre-calculated, enforcedStatistical validity
AnalysisPre-registered primary metricPrevents p-hacking
DurationMinimum 2 weeksCaptures novelty effects
DecisionInclude guardrail checkPrevents unintended harm