Experimentation for Data Scientists
Experiment design determines the validity of results. Statistical analysis cannot correct for design flaws.
Pre-Experiment Requirements
Required Specifications
| Specification | Description |
|---|---|
| Hypothesis | Specific, falsifiable prediction (e.g., "Social proof will increase conversion by at least 5%") |
| Primary metric | Single metric that determines success |
| Randomization unit | Entity receiving random assignment (user, session, page view) |
| Duration | Calculated from power analysis |
Power Analysis
Required Inputs
- Minimum detectable effect (smallest meaningful improvement)
- Significance level (alpha, typically 0.05)
- Statistical power (1-beta, typically 0.80)
- Metric variance (from historical data)
Sample Size Relationship
Sample size is a function of effect size, significance level, and power.
Smaller detectable effects require proportionally larger samples. A 1% lift requires substantially more data than a 10% lift.
Randomization
Randomization Unit Selection
| Unit | Characteristics | Considerations |
|---|---|---|
| User | Consistent experience, most common | Slower sample accumulation |
| Session | More data points, suitable for logged-out users | Same user may see different treatments |
| Page view | Maximum sample size | Inconsistent user experience |
Trade-off: Smaller units provide more samples but may expose the same user to multiple treatments, which affects most experiment types negatively.
Stratified Randomization
When treatment effects may vary by segment, stratify randomization to ensure balanced representation:
- Device type
- New vs returning users
- Geography
- Acquisition channel
Stratification reduces variance and enables detection of segment-specific effects.
Common Design Errors
Repeated Analysis (Peeking)
Checking results daily and stopping when significance is observed inflates false positive rates beyond the nominal alpha level.
Solutions:
- Commit to fixed duration determined by power analysis
- Use sequential testing methods designed for continuous monitoring
Excessive Variants
Testing 5 variants requires approximately 5x the sample size to maintain statistical power for each comparison.
Novelty and Primacy Effects
| Effect | Description | Duration |
|---|---|---|
| Novelty | Increased engagement with new features because they are new | 2-3 weeks to diminish |
| Primacy | Resistance to change; initial negative reaction | 1-2 weeks typically |
Experiments should run at least 2-3 weeks to allow these effects to stabilize.
Network Effects
In applications with user interaction (social networks, marketplaces), treatment users may influence control users, violating the independence assumption.
Solution: Cluster randomization (randomize friend groups, geographic regions, or other natural clusters).
Multiple Testing
Testing 20 metrics at alpha = 0.05 produces approximately 1 false positive on average.
Solutions:
- Pre-register primary metric
- Apply Bonferroni correction (alpha / number of tests)
- Use False Discovery Rate methods
Result Analysis
Analysis Checklist
- Randomization balance: Verify control and treatment groups have similar baseline characteristics
- Point estimate: Measured lift or difference
- Confidence interval: Range of plausible effect sizes
- Statistical significance: p-value relative to alpha threshold
- Guardrail metrics: Verify no negative impact on critical metrics
- Segment analysis: Check for heterogeneous effects by user type, platform, geography
Result Interpretation
| Observation | Interpretation | Recommended Action |
|---|---|---|
| Significant positive lift | Effect likely real | Consider shipping; monitor post-launch |
| Significant negative lift | Negative effect detected | Do not ship; investigate cause |
| Not significant, narrow CI | Effect is likely small or zero | Conclude no meaningful effect |
| Not significant, wide CI | Insufficient statistical power | Extend duration or accept uncertainty |
The final case is common: "After 3 weeks, the effect could be +2% or -2%." Low-traffic experiments may not have sufficient power to detect small effects.