Skip to main content

Experimentation for Data Scientists

Experiment design determines the validity of results. Statistical analysis cannot correct for design flaws.

Pre-Experiment Requirements

Required Specifications

SpecificationDescription
HypothesisSpecific, falsifiable prediction (e.g., "Social proof will increase conversion by at least 5%")
Primary metricSingle metric that determines success
Randomization unitEntity receiving random assignment (user, session, page view)
DurationCalculated from power analysis

Power Analysis

Required Inputs

  1. Minimum detectable effect (smallest meaningful improvement)
  2. Significance level (alpha, typically 0.05)
  3. Statistical power (1-beta, typically 0.80)
  4. Metric variance (from historical data)

Sample Size Relationship

Sample size is a function of effect size, significance level, and power.

Smaller detectable effects require proportionally larger samples. A 1% lift requires substantially more data than a 10% lift.

Randomization

Randomization Unit Selection

UnitCharacteristicsConsiderations
UserConsistent experience, most commonSlower sample accumulation
SessionMore data points, suitable for logged-out usersSame user may see different treatments
Page viewMaximum sample sizeInconsistent user experience

Trade-off: Smaller units provide more samples but may expose the same user to multiple treatments, which affects most experiment types negatively.

Stratified Randomization

When treatment effects may vary by segment, stratify randomization to ensure balanced representation:

  • Device type
  • New vs returning users
  • Geography
  • Acquisition channel

Stratification reduces variance and enables detection of segment-specific effects.

Common Design Errors

Repeated Analysis (Peeking)

Checking results daily and stopping when significance is observed inflates false positive rates beyond the nominal alpha level.

Solutions:

  • Commit to fixed duration determined by power analysis
  • Use sequential testing methods designed for continuous monitoring

Excessive Variants

Testing 5 variants requires approximately 5x the sample size to maintain statistical power for each comparison.

Novelty and Primacy Effects

EffectDescriptionDuration
NoveltyIncreased engagement with new features because they are new2-3 weeks to diminish
PrimacyResistance to change; initial negative reaction1-2 weeks typically

Experiments should run at least 2-3 weeks to allow these effects to stabilize.

Network Effects

In applications with user interaction (social networks, marketplaces), treatment users may influence control users, violating the independence assumption.

Solution: Cluster randomization (randomize friend groups, geographic regions, or other natural clusters).

Multiple Testing

Testing 20 metrics at alpha = 0.05 produces approximately 1 false positive on average.

Solutions:

  • Pre-register primary metric
  • Apply Bonferroni correction (alpha / number of tests)
  • Use False Discovery Rate methods

Result Analysis

Analysis Checklist

  1. Randomization balance: Verify control and treatment groups have similar baseline characteristics
  2. Point estimate: Measured lift or difference
  3. Confidence interval: Range of plausible effect sizes
  4. Statistical significance: p-value relative to alpha threshold
  5. Guardrail metrics: Verify no negative impact on critical metrics
  6. Segment analysis: Check for heterogeneous effects by user type, platform, geography

Result Interpretation

ObservationInterpretationRecommended Action
Significant positive liftEffect likely realConsider shipping; monitor post-launch
Significant negative liftNegative effect detectedDo not ship; investigate cause
Not significant, narrow CIEffect is likely small or zeroConclude no meaningful effect
Not significant, wide CIInsufficient statistical powerExtend duration or accept uncertainty

The final case is common: "After 3 weeks, the effect could be +2% or -2%." Low-traffic experiments may not have sufficient power to detect small effects.