Skip to main content

Statistics for Data Science Interviews

Statistics provides the framework for making decisions from data. This section covers the concepts and methods commonly tested in data science interviews.

Descriptive Statistics

Measures of Central Tendency

MeasureDescriptionProperties
MeanSum divided by countSensitive to outliers
MedianMiddle value when sortedRobust to outliers
ModeMost frequent valueApplicable to categorical data

Selection guidance: Use median for skewed distributions (e.g., salary data). Use mean for symmetric distributions.

Measures of Spread

MeasureDescriptionProperties
VarianceAverage squared deviation from meanUnits are squared
Standard deviationSquare root of varianceSame units as data
Interquartile range (IQR)Distance between 25th and 75th percentileRobust to outliers

Hypothesis Testing

Framework

  1. State hypotheses

    • H0 (null): No effect or no difference
    • H1 (alternative): Effect exists
  2. Select significance level (alpha)

    • Standard value: 0.05 (5% false positive rate)
  3. Collect data and compute test statistic

  4. Calculate p-value

    • Probability of observing data this extreme if H0 is true
  5. Decision

    • p < alpha: Reject H0
    • p >= alpha: Fail to reject H0

P-Value Definition

The p-value is the probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true.

P-value is NOT:

  • Probability that the null hypothesis is true
  • Probability of making an error
  • Probability that the effect is real

Example interpretation: p = 0.03 means "If there were truly no effect, data this extreme would occur only 3% of the time."

Error Types

H0 True (No Effect)H0 False (Effect Exists)
Reject H0Type I error (alpha)Correct decision
Fail to reject H0Correct decisionType II error (beta)
Error TypeDescriptionControlled By
Type I (false positive)Detecting effect that does not existalpha (significance level)
Type II (false negative)Missing effect that existsPower (1 - beta)

Trade-off: Decreasing alpha reduces false positives but increases false negatives. Both cannot be minimized simultaneously.

Common Statistical Tests

TestApplication
One-sample t-testCompare sample mean to known value
Two-sample t-testCompare means of two independent groups
Paired t-testCompare means of paired observations (before/after)
Chi-square testTest independence of categorical variables
ANOVACompare means across 3+ groups

Confidence Intervals

A confidence interval provides a range of plausible values for a parameter.

Formula (95% CI for mean)

sample mean +/- 1.96 x (standard deviation / square root of n)

Interpretation

"95% confident" means: If the experiment were repeated 100 times, approximately 95 of the resulting confidence intervals would contain the true parameter value.

Note: This does NOT mean "95% probability that the true value is in this specific interval." The true value is fixed; it either is or is not in the interval.

Factors Affecting Width

FactorEffect on Width
Sample size (n)Larger n = narrower interval
Data variabilityMore spread = wider interval
Confidence levelHigher confidence = wider interval

Effect Size

Statistical significance does not indicate practical importance. Effect size measures the magnitude of an effect.

Cohen's d

d = (mean1 - mean2) / pooled standard deviation

d ValueInterpretation
0.2Small effect
0.5Medium effect
0.8Large effect

Example: A treatment may produce a statistically significant (p < 0.001) reduction of 0.5 mmHg in blood pressure. The effect is statistically significant but may not be clinically meaningful.

Correlation vs Causation

ConceptDefinition
CorrelationTwo variables move together
CausationOne variable directly influences another

Reasons correlation does not imply causation:

  • Confounding variable affects both
  • Reverse causality
  • Coincidental correlation

Requirements for establishing causation:

  1. Correlation exists (necessary but not sufficient)
  2. Temporal order (cause precedes effect)
  3. No confounders (or adjusted for confounders)
  4. Ideally, randomized experiment

Regression

Linear regression models the relationship between variables:

Y = intercept + slope x X + error term

ComponentInterpretation
beta0 (intercept)Predicted Y when X = 0
beta1 (slope)Change in Y per unit increase in X
R-squaredProportion of variance explained (0 to 1)

Assumptions

AssumptionVerification Method
Linear relationshipScatter plot
IndependenceStudy design
HomoscedasticityResidual plot (constant variance)
Normality of residualsQ-Q plot

When assumptions are violated, standard errors and p-values may be unreliable.

Regularization

MethodPenaltyUse Case
Ridge (L2)Sum of squared coefficientsAll features potentially relevant
Lasso (L1)Sum of absolute coefficientsFeature selection desired
Elastic NetCombination of L1 and L2Balance between ridge and lasso

Multiple Testing Problem

Testing multiple hypotheses inflates the overall false positive rate.

Example: With 20 tests at alpha = 0.05, the probability of at least one false positive is approximately 64%.

Corrections

MethodApproachProperties
Bonferronialpha / number of testsConservative (reduces power)
False Discovery Rate (FDR)Controls expected proportion of false discoveriesLess conservative

Common Interview Topics

Explain p-value to non-technical audience

"The p-value indicates how surprising the data would be if there were no real effect. A small p-value suggests the data would be unlikely to occur by chance alone, so we conclude there is probably a real effect."

Type I vs Type II error distinction

Type I error is a false alarm: concluding an effect exists when it does not. Type II error is a miss: failing to detect an effect that exists.

Application: In medical testing, Type I may lead to unnecessary treatment of healthy individuals. Type II may result in missed diagnoses.

When to use non-parametric tests

Non-parametric tests are appropriate when normality assumptions are violated, particularly with small samples or highly skewed data. The trade-off is reduced statistical power, requiring larger samples to detect the same effect size.

A/B test shows p = 0.06

At alpha = 0.05, the result is not statistically significant, but it is close to the threshold. Relevant considerations:

  • Effect size magnitude
  • Confidence interval range
  • Sample size adequacy (power analysis)
  • Option to extend test duration or acknowledge uncertainty

Common Pitfalls

PitfallDescription
P-hackingRunning multiple analyses until finding significant result
Simpson's ParadoxAggregate trend reverses when data is segmented
Survivorship BiasAnalyzing only successful cases, missing failures
HARKingHypothesizing After Results are Known

Quick Reference

ScenarioApproach
Compare two meanst-test (Mann-Whitney if non-normal)
Compare 3+ meansANOVA
Relationship between variablesCorrelation / regression
Categorical variable independenceChi-square test
Before/after same subjectsPaired t-test
Feature selection with many variablesLasso
Multiple hypothesis testsBonferroni or FDR correction