Statistics for Data Science Interviews
Statistics provides the framework for making decisions from data. This section covers the concepts and methods commonly tested in data science interviews.
Descriptive Statistics
Measures of Central Tendency
| Measure | Description | Properties |
|---|---|---|
| Mean | Sum divided by count | Sensitive to outliers |
| Median | Middle value when sorted | Robust to outliers |
| Mode | Most frequent value | Applicable to categorical data |
Selection guidance: Use median for skewed distributions (e.g., salary data). Use mean for symmetric distributions.
Measures of Spread
| Measure | Description | Properties |
|---|---|---|
| Variance | Average squared deviation from mean | Units are squared |
| Standard deviation | Square root of variance | Same units as data |
| Interquartile range (IQR) | Distance between 25th and 75th percentile | Robust to outliers |
Hypothesis Testing
Framework
-
State hypotheses
- H0 (null): No effect or no difference
- H1 (alternative): Effect exists
-
Select significance level (alpha)
- Standard value: 0.05 (5% false positive rate)
-
Collect data and compute test statistic
-
Calculate p-value
- Probability of observing data this extreme if H0 is true
-
Decision
- p < alpha: Reject H0
- p >= alpha: Fail to reject H0
P-Value Definition
The p-value is the probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true.
P-value is NOT:
- Probability that the null hypothesis is true
- Probability of making an error
- Probability that the effect is real
Example interpretation: p = 0.03 means "If there were truly no effect, data this extreme would occur only 3% of the time."
Error Types
| H0 True (No Effect) | H0 False (Effect Exists) | |
|---|---|---|
| Reject H0 | Type I error (alpha) | Correct decision |
| Fail to reject H0 | Correct decision | Type II error (beta) |
| Error Type | Description | Controlled By |
|---|---|---|
| Type I (false positive) | Detecting effect that does not exist | alpha (significance level) |
| Type II (false negative) | Missing effect that exists | Power (1 - beta) |
Trade-off: Decreasing alpha reduces false positives but increases false negatives. Both cannot be minimized simultaneously.
Common Statistical Tests
| Test | Application |
|---|---|
| One-sample t-test | Compare sample mean to known value |
| Two-sample t-test | Compare means of two independent groups |
| Paired t-test | Compare means of paired observations (before/after) |
| Chi-square test | Test independence of categorical variables |
| ANOVA | Compare means across 3+ groups |
Confidence Intervals
A confidence interval provides a range of plausible values for a parameter.
Formula (95% CI for mean)
sample mean +/- 1.96 x (standard deviation / square root of n)
Interpretation
"95% confident" means: If the experiment were repeated 100 times, approximately 95 of the resulting confidence intervals would contain the true parameter value.
Note: This does NOT mean "95% probability that the true value is in this specific interval." The true value is fixed; it either is or is not in the interval.
Factors Affecting Width
| Factor | Effect on Width |
|---|---|
| Sample size (n) | Larger n = narrower interval |
| Data variability | More spread = wider interval |
| Confidence level | Higher confidence = wider interval |
Effect Size
Statistical significance does not indicate practical importance. Effect size measures the magnitude of an effect.
Cohen's d
d = (mean1 - mean2) / pooled standard deviation
| d Value | Interpretation |
|---|---|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |
Example: A treatment may produce a statistically significant (p < 0.001) reduction of 0.5 mmHg in blood pressure. The effect is statistically significant but may not be clinically meaningful.
Correlation vs Causation
| Concept | Definition |
|---|---|
| Correlation | Two variables move together |
| Causation | One variable directly influences another |
Reasons correlation does not imply causation:
- Confounding variable affects both
- Reverse causality
- Coincidental correlation
Requirements for establishing causation:
- Correlation exists (necessary but not sufficient)
- Temporal order (cause precedes effect)
- No confounders (or adjusted for confounders)
- Ideally, randomized experiment
Regression
Linear regression models the relationship between variables:
Y = intercept + slope x X + error term
| Component | Interpretation |
|---|---|
| beta0 (intercept) | Predicted Y when X = 0 |
| beta1 (slope) | Change in Y per unit increase in X |
| R-squared | Proportion of variance explained (0 to 1) |
Assumptions
| Assumption | Verification Method |
|---|---|
| Linear relationship | Scatter plot |
| Independence | Study design |
| Homoscedasticity | Residual plot (constant variance) |
| Normality of residuals | Q-Q plot |
When assumptions are violated, standard errors and p-values may be unreliable.
Regularization
| Method | Penalty | Use Case |
|---|---|---|
| Ridge (L2) | Sum of squared coefficients | All features potentially relevant |
| Lasso (L1) | Sum of absolute coefficients | Feature selection desired |
| Elastic Net | Combination of L1 and L2 | Balance between ridge and lasso |
Multiple Testing Problem
Testing multiple hypotheses inflates the overall false positive rate.
Example: With 20 tests at alpha = 0.05, the probability of at least one false positive is approximately 64%.
Corrections
| Method | Approach | Properties |
|---|---|---|
| Bonferroni | alpha / number of tests | Conservative (reduces power) |
| False Discovery Rate (FDR) | Controls expected proportion of false discoveries | Less conservative |
Common Interview Topics
Explain p-value to non-technical audience
"The p-value indicates how surprising the data would be if there were no real effect. A small p-value suggests the data would be unlikely to occur by chance alone, so we conclude there is probably a real effect."
Type I vs Type II error distinction
Type I error is a false alarm: concluding an effect exists when it does not. Type II error is a miss: failing to detect an effect that exists.
Application: In medical testing, Type I may lead to unnecessary treatment of healthy individuals. Type II may result in missed diagnoses.
When to use non-parametric tests
Non-parametric tests are appropriate when normality assumptions are violated, particularly with small samples or highly skewed data. The trade-off is reduced statistical power, requiring larger samples to detect the same effect size.
A/B test shows p = 0.06
At alpha = 0.05, the result is not statistically significant, but it is close to the threshold. Relevant considerations:
- Effect size magnitude
- Confidence interval range
- Sample size adequacy (power analysis)
- Option to extend test duration or acknowledge uncertainty
Common Pitfalls
| Pitfall | Description |
|---|---|
| P-hacking | Running multiple analyses until finding significant result |
| Simpson's Paradox | Aggregate trend reverses when data is segmented |
| Survivorship Bias | Analyzing only successful cases, missing failures |
| HARKing | Hypothesizing After Results are Known |
Quick Reference
| Scenario | Approach |
|---|---|
| Compare two means | t-test (Mann-Whitney if non-normal) |
| Compare 3+ means | ANOVA |
| Relationship between variables | Correlation / regression |
| Categorical variable independence | Chi-square test |
| Before/after same subjects | Paired t-test |
| Feature selection with many variables | Lasso |
| Multiple hypothesis tests | Bonferroni or FDR correction |