Data Science Concept Questions
This section covers concept questions that appear frequently in data science interviews, along with expected answer frameworks.
Statistics & Probability
Q1: Central Limit Theorem
Question: What is the Central Limit Theorem and why does it matter?
Expected answer:
The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution.
Significance:
- Statistical methods assume normality (confidence intervals, hypothesis tests, regression)
- The CLT allows these methods to work even when underlying data is non-normal
- Rule of thumb: n >= 30, though more skewed distributions require larger samples
Q2: Type I and Type II Errors
Question: Explain the difference between Type I and Type II errors.
Expected answer:
| Error Type | Definition | Example (Medical Testing) |
|---|---|---|
| Type I (false positive) | Detecting effect that does not exist | Treating healthy patient |
| Type II (false negative) | Missing effect that exists | Missing disease |
Relationship:
- alpha controls Type I error rate (typically 0.05)
- Power = 1 - beta = probability of detecting true effect
- Trade-off: Decreasing alpha increases Type II error rate
Context dependence: Which error is worse depends on consequences. For serious disease with good treatment, Type II (missing diagnosis) is worse. For risky treatment, Type I (unnecessary treatment) may be worse.
Q3: P-Value
Question: What is a p-value? What does p = 0.03 mean?
Expected answer:
Definition: The p-value is the probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true.
Interpretation of p = 0.03: If there were truly no effect, data this extreme would occur only 3% of the time.
Common misconceptions (what p-value is NOT):
- Probability that the null hypothesis is true
- Probability that the effect is real
- Probability of making an error
- Indicator of effect size (large samples can produce small p-values for negligible effects)
Q4: T-test vs Z-test
Question: When would you use a t-test vs a z-test?
Expected answer:
| Test | Use When |
|---|---|
| Z-test | Population standard deviation known (rare) |
| T-test | Population standard deviation estimated from sample (common) |
Practical guidance: Use t-tests. Population standard deviation is rarely known. With large samples (n > 30), t-tests and z-tests produce nearly identical results.
T-test variants:
| Type | Application |
|---|---|
| One-sample | Compare sample mean to known value |
| Two-sample | Compare means of two independent groups |
| Paired | Compare before/after on same subjects |
Q5: Bayes' Theorem
Question: What is Bayes' theorem? Provide a practical example.
Expected answer:
Formula:
P(A|B) = P(B|A) x P(A) / P(B)
Medical test example:
- Disease prevalence: 1%
- Test sensitivity (true positive rate): 95%
- Test specificity (true negative rate): 90%
- Question: Given positive test, what is probability of disease?
Calculation:
Using Bayes' theorem: P(Disease|Positive) = P(Positive|Disease) x P(Disease) / P(Positive)
First, calculate P(Positive) = (0.95 x 0.01) + (0.10 x 0.99) = 0.1085
Then, P(Disease|Positive) = (0.95 x 0.01) / 0.1085 = approximately 8.8%
Interpretation: Despite positive test, probability of disease is approximately 8.8%. The low base rate (1% prevalence) means most positive tests are false positives.
A/B Testing & Experimentation
Q6: Sample Size Calculation
Question: How do you calculate sample size for an A/B test?
Expected answer:
Required inputs:
- Baseline rate (current metric value)
- Minimum Detectable Effect (MDE)
- Significance level (alpha, typically 0.05)
- Power (1-beta, typically 0.80)
Relationships:
- Smaller effects require larger samples
- Higher power requires larger samples
- Higher variance requires larger samples
Practical constraint: Sample size is often the bottleneck. If detecting a 1% lift requires 2 million users but traffic is 100K/week, either run for 20 weeks or accept detection of only larger effects.
Q7: Statistical vs Practical Significance
Question: A/B test shows p = 0.04 but effect size is small. What do you recommend?
Expected answer:
Statistical significance indicates the result is unlikely due to chance. It does not indicate practical importance.
Evaluation framework:
- Confidence interval: If CI is [0.01%, 0.5%], true effect is small regardless
- Business impact: 0.1% lift on billion-dollar metric = $1M; on small feature = $100
- Implementation cost: Engineering time, maintenance, complexity
- Compounding potential: Many small improvements may compound, but distinguish from noise
Recommendation: Report finding accurately. Shipping decision depends on implementation cost relative to expected value.
Q8: Multiple Testing Problem
Question: What is the multiple testing problem and how do you address it?
Expected answer:
Testing multiple hypotheses inflates false positive rate. With 20 tests at alpha = 0.05, approximately 64% probability of at least one false positive.
Solutions:
| Method | Approach |
|---|---|
| Bonferroni correction | alpha_adjusted = alpha / number_of_tests (conservative) |
| False Discovery Rate (FDR) | Controls expected proportion of false discoveries (less conservative) |
| Pre-registration | Define primary hypothesis before analysis |
| Hierarchical testing | Test primary metric first |
Q9: Simpson's Paradox
Question: Explain Simpson's Paradox with an example.
Expected answer:
A trend in aggregated data reverses when split by a confounding variable.
Example: UC Berkeley admissions
- Overall: Men admitted at higher rate
- By department: Women admitted at higher rate in most departments
- Explanation: Women applied to more competitive departments
Implication: Always check for confounders. Aggregate data can be misleading.
Q10: Novelty and Primacy Effects
Question: What are novelty and primacy effects in A/B testing?
Expected answer:
| Effect | Description | Duration |
|---|---|---|
| Novelty | Increased engagement with new features because they are new | Fades over 2-3 weeks |
| Primacy | Resistance to change; initial negative reaction | Typically 1-2 weeks |
Mitigation strategies:
- Run tests for at least 2 weeks
- Exclude returning users initially
- Monitor metrics over time, not just aggregate
- Use holdout groups for long-term measurement
Machine Learning
Q11: Bias-Variance Trade-off
Question: Explain the bias-variance trade-off.
Expected answer:
Model error decomposes into:
Total Error = Bias^2 + Variance + Irreducible Noise
| Component | Definition | Symptom |
|---|---|---|
| Bias | Model too simple, misses patterns | Underfitting |
| Variance | Model too sensitive to training data, fits noise | Overfitting |
Diagnosis:
| Observation | Problem | Solution |
|---|---|---|
| High training error, high test error | High bias | More features, complex model, less regularization |
| Low training error, high test error | High variance | More data, simpler model, more regularization |
Q12: Class Imbalance
Question: How do you handle a dataset with 95% class A and 5% class B?
Expected answer:
First: Replace accuracy with appropriate metrics (precision, recall, F1, PR-AUC). Accuracy is misleading for imbalanced data.
Techniques:
| Technique | Method |
|---|---|
| Resampling (SMOTE) | Generate synthetic minority examples |
| Undersampling | Reduce majority class examples |
| Class weights | Penalize minority misclassification more heavily |
| Threshold tuning | Adjust classification threshold below 0.5 |
Q13: Precision vs Recall vs F1
Question: When would you use precision vs recall vs F1?
Expected answer:
| Metric | Formula | Optimize When |
|---|---|---|
| Precision | TP / (TP + FP) | False positives are costly (spam filter: do not lose real emails) |
| Recall | TP / (TP + FN) | False negatives are costly (cancer screening: do not miss cases) |
| F1 | 2 * (P * R) / (P + R) | Balance between precision and recall; imbalanced classes |
Q14: Regularization (L1 vs L2)
Question: Explain regularization. When would you use L1 vs L2?
Expected answer:
Regularization adds penalty to loss function to prevent overfitting.
| Type | Penalty | Effect | Use Case |
|---|---|---|---|
| L1 (Lasso) | Sum of absolute coefficients | Can shrink coefficients to exactly zero | Feature selection |
| L2 (Ridge) | Sum of squared coefficients | Shrinks coefficients toward zero | All features potentially relevant |
| Elastic Net | Combination | Both effects | Balance between L1 and L2 |
Q15: Cross-Validation
Question: What is cross-validation and when would you use k-fold vs leave-one-out?
Expected answer:
Cross-validation splits data into k parts, trains on k-1, tests on remaining part, and rotates.
| Type | Description | Use Case |
|---|---|---|
| k-fold (k=5 or 10) | Standard, good balance of variance and computation | General purpose |
| Leave-one-out | k = n, high variance, expensive | Very small datasets only |
| Stratified k-fold | Maintains class distribution | Imbalanced data |
| Time-series split | Respects temporal order | Temporal data (prevents future data leakage) |
Product Analytics
Q16: Feature Success Metrics
Question: How would you define success metrics for a new feature?
Expected answer:
Framework:
- Identify feature goal (engagement, revenue, retention)
- Define target user segment
- Identify expected behavior change
Metric categories:
| Type | Purpose | Example |
|---|---|---|
| Primary | Determines success | Day 7 retention |
| Secondary | Additional context | Time to first action, features used |
| Guardrail | Should not degrade | Support tickets, drop-off rate |
Q17: Metric Drop Investigation
Question: DAU dropped 5% this week. How do you investigate?
Expected answer:
Structured approach:
- Validate data: Is drop real or measurement issue?
- Segment: Which users/platforms/regions affected?
- Timeline: When exactly did it start? Any deployments?
- External factors: Holidays, competitor launches, seasonality?
- Correlated metrics: What else changed?
Common causes:
- Bug or deployment issue
- Seasonality
- One-time event (outage, news)
- Actual user behavior change
Q18: Good Metric Criteria
Question: What makes a good metric?
Expected answer:
| Criterion | Description |
|---|---|
| Measurable | Can compute from available data |
| Understandable | Team agrees on definition |
| Actionable | Can influence with product changes |
| Attributable | Can tie changes to specific actions |
| Timely | Available quickly enough to act on |
| Comparable | Meaningful across time/segments |
Poor metric examples:
- "User satisfaction" (difficult to measure)
- Revenue per lifetime (too delayed)
- Total users ever (only increases, not actionable)
Q19: Correlation vs Causation
Question: Explain the difference between correlation and causation.
Expected answer:
| Concept | Definition |
|---|---|
| Correlation | Two variables move together |
| Causation | One variable directly influences another |
Why correlation does not imply causation:
- Confounding: Third variable affects both (ice cream and drowning both caused by summer)
- Reverse causality: Direction is opposite (hospitals have high death rates because sick people go there)
- Coincidence: Spurious correlation (divorce rate in Maine correlates with margarine consumption)
Requirements for establishing causation:
- Correlation exists
- Temporal order (cause precedes effect)
- No confounders (or adjusted for them)
- Plausible mechanism
- Ideally, randomized experiment (gold standard)
Q20: Survivorship Bias
Question: What is survivorship bias?
Expected answer:
Survivorship bias occurs when analysis includes only successful cases because failures are not visible.
Example: WWII aircraft analysis examined planes returning from missions and found bullet holes in certain areas. Initial recommendation: add armor there. Correct insight: armor where holes are NOT present. Planes hit in those areas did not return.
Data science applications:
- Studying successful startups without examining failures
- Analyzing active users without considering churned users
- Examining completed purchases without abandoned carts
Mitigation: Actively seek "missing" data. Analyze churned users, failed experiments, abandoned processes.
SQL & Data Manipulation
Q21: Window Functions vs GROUP BY
Question: Explain window functions vs GROUP BY.
Expected answer:
| Feature | GROUP BY | Window Functions |
|---|---|---|
| Output rows | One per group | All original rows |
| Aggregation | Collapses rows | Computes across related rows |
GROUP BY example: Selecting department and average salary grouped by department returns one row per department.
Window function example: Selecting name, salary, and average salary partitioned by department returns every employee row with their department's average added as a new column.
Window function applications:
- Running totals
- Rankings within groups
- Comparing to group averages
- Lag/lead analysis
Q22: Second Highest Salary
Question: How do you find the second highest salary?
Expected answer:
Using DENSE_RANK (handles ties): Create a subquery that assigns DENSE_RANK ordered by salary descending, then filter for rank = 2 in the outer query.
Using LIMIT/OFFSET: Select distinct salaries ordered descending, skip the first one (OFFSET 1), and take one result (LIMIT 1).
Using MAX: Find the maximum salary that is less than the overall maximum salary.
DENSE_RANK is most robust for "Nth highest" questions with potential ties.
Q23: CTEs
Question: Explain CTEs and when to use them.
Expected answer:
CTEs (Common Table Expressions) create named temporary result sets.
Example: Define a CTE called active_users that selects user IDs where status is 'active', then use it in the main query to filter orders to only those from active users.
Use cases:
- Query has multiple steps (improves readability)
- Same subquery used multiple times
- Recursive queries
- Breaking down complex logic
Q24: WHERE vs HAVING
Question: What is the difference between WHERE and HAVING?
Expected answer:
| Clause | Timing | Purpose |
|---|---|---|
| WHERE | Before grouping | Filters individual rows |
| HAVING | After aggregation | Filters groups |
Example: A query that counts employees per department would use WHERE to filter for active employees (before grouping), then HAVING to keep only departments with more than 10 active employees (after grouping).
Execution order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
Q25: NULL Handling
Question: How do you handle NULL values in SQL?
Expected answer:
NULL behavior:
NULL = NULLreturns NULL (not TRUE)- Use
IS NULLorIS NOT NULLfor comparisons COUNT(*)counts all rows;COUNT(column)excludes NULLsSUM/AVGignore NULLs- NULL in comparisons produces NULL (unknown)
Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULLs, explicitly add OR status IS NULL.
Default values: Use COALESCE to provide default values for NULLs (e.g., replace NULL middle_name with empty string).