Data Science Concept Questions

This section covers concept questions that appear frequently in data science interviews, along with expected answer frameworks.

Statistics & Probability

Q1: Central Limit Theorem

Question: What is the Central Limit Theorem and why does it matter?

Expected answer:

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution.

Significance:

Statistical methods assume normality (confidence intervals, hypothesis tests, regression)
The CLT allows these methods to work even when underlying data is non-normal
Rule of thumb: n >= 30, though more skewed distributions require larger samples

Q2: Type I and Type II Errors

Question: Explain the difference between Type I and Type II errors.

Expected answer:

Error Type	Definition	Example (Medical Testing)
Type I (false positive)	Detecting effect that does not exist	Treating healthy patient
Type II (false negative)	Missing effect that exists	Missing disease

Relationship:

alpha controls Type I error rate (typically 0.05)
Power = 1 - beta = probability of detecting true effect
Trade-off: Decreasing alpha increases Type II error rate

Context dependence: Which error is worse depends on consequences. For serious disease with good treatment, Type II (missing diagnosis) is worse. For risky treatment, Type I (unnecessary treatment) may be worse.

Q3: P-Value

Question: What is a p-value? What does p = 0.03 mean?

Expected answer:

Definition: The p-value is the probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true.

Interpretation of p = 0.03: If there were truly no effect, data this extreme would occur only 3% of the time.

Common misconceptions (what p-value is NOT):

Probability that the null hypothesis is true
Probability that the effect is real
Probability of making an error
Indicator of effect size (large samples can produce small p-values for negligible effects)

Q4: T-test vs Z-test

Question: When would you use a t-test vs a z-test?

Expected answer:

Test	Use When
Z-test	Population standard deviation known (rare)
T-test	Population standard deviation estimated from sample (common)

Practical guidance: Use t-tests. Population standard deviation is rarely known. With large samples (n > 30), t-tests and z-tests produce nearly identical results.

T-test variants:

Type	Application
One-sample	Compare sample mean to known value
Two-sample	Compare means of two independent groups
Paired	Compare before/after on same subjects

Q5: Bayes' Theorem

Question: What is Bayes' theorem? Provide a practical example.

Expected answer:

Formula:

P(A|B) = P(B|A) x P(A) / P(B)

Medical test example:

Disease prevalence: 1%
Test sensitivity (true positive rate): 95%
Test specificity (true negative rate): 90%
Question: Given positive test, what is probability of disease?

Calculation:

Using Bayes' theorem: P(Disease|Positive) = P(Positive|Disease) x P(Disease) / P(Positive)

First, calculate P(Positive) = (0.95 x 0.01) + (0.10 x 0.99) = 0.1085

Then, P(Disease|Positive) = (0.95 x 0.01) / 0.1085 = approximately 8.8%

Interpretation: Despite positive test, probability of disease is approximately 8.8%. The low base rate (1% prevalence) means most positive tests are false positives.

A/B Testing & Experimentation

Q6: Sample Size Calculation

Question: How do you calculate sample size for an A/B test?

Expected answer:

Required inputs:

Baseline rate (current metric value)
Minimum Detectable Effect (MDE)
Significance level (alpha, typically 0.05)
Power (1-beta, typically 0.80)

Relationships:

Smaller effects require larger samples
Higher power requires larger samples
Higher variance requires larger samples

Practical constraint: Sample size is often the bottleneck. If detecting a 1% lift requires 2 million users but traffic is 100K/week, either run for 20 weeks or accept detection of only larger effects.

Q7: Statistical vs Practical Significance

Question: A/B test shows p = 0.04 but effect size is small. What do you recommend?

Expected answer:

Statistical significance indicates the result is unlikely due to chance. It does not indicate practical importance.

Evaluation framework:

Confidence interval: If CI is [0.01%, 0.5%], true effect is small regardless
Business impact: 0.1% lift on billion-dollar metric = $1M; on small feature = $100
Implementation cost: Engineering time, maintenance, complexity
Compounding potential: Many small improvements may compound, but distinguish from noise

Recommendation: Report finding accurately. Shipping decision depends on implementation cost relative to expected value.

Q8: Multiple Testing Problem

Question: What is the multiple testing problem and how do you address it?

Expected answer:

Testing multiple hypotheses inflates false positive rate. With 20 tests at alpha = 0.05, approximately 64% probability of at least one false positive.

Solutions:

Method	Approach
Bonferroni correction	alpha_adjusted = alpha / number_of_tests (conservative)
False Discovery Rate (FDR)	Controls expected proportion of false discoveries (less conservative)
Pre-registration	Define primary hypothesis before analysis
Hierarchical testing	Test primary metric first

Q9: Simpson's Paradox

Question: Explain Simpson's Paradox with an example.

Expected answer:

A trend in aggregated data reverses when split by a confounding variable.

Example: UC Berkeley admissions

Overall: Men admitted at higher rate
By department: Women admitted at higher rate in most departments
Explanation: Women applied to more competitive departments

Implication: Always check for confounders. Aggregate data can be misleading.

Q10: Novelty and Primacy Effects

Question: What are novelty and primacy effects in A/B testing?

Expected answer:

Effect	Description	Duration
Novelty	Increased engagement with new features because they are new	Fades over 2-3 weeks
Primacy	Resistance to change; initial negative reaction	Typically 1-2 weeks

Mitigation strategies:

Run tests for at least 2 weeks
Exclude returning users initially
Monitor metrics over time, not just aggregate
Use holdout groups for long-term measurement

Machine Learning

Q11: Bias-Variance Trade-off

Question: Explain the bias-variance trade-off.

Expected answer:

Model error decomposes into:

Total Error = Bias^2 + Variance + Irreducible Noise

Component	Definition	Symptom
Bias	Model too simple, misses patterns	Underfitting
Variance	Model too sensitive to training data, fits noise	Overfitting

Diagnosis:

Observation	Problem	Solution
High training error, high test error	High bias	More features, complex model, less regularization
Low training error, high test error	High variance	More data, simpler model, more regularization

Q12: Class Imbalance

Question: How do you handle a dataset with 95% class A and 5% class B?

Expected answer:

First: Replace accuracy with appropriate metrics (precision, recall, F1, PR-AUC). Accuracy is misleading for imbalanced data.

Techniques:

Technique	Method
Resampling (SMOTE)	Generate synthetic minority examples
Undersampling	Reduce majority class examples
Class weights	Penalize minority misclassification more heavily
Threshold tuning	Adjust classification threshold below 0.5

Q13: Precision vs Recall vs F1

Question: When would you use precision vs recall vs F1?

Expected answer:

Metric	Formula	Optimize When
Precision	TP / (TP + FP)	False positives are costly (spam filter: do not lose real emails)
Recall	TP / (TP + FN)	False negatives are costly (cancer screening: do not miss cases)
F1	2 * (P * R) / (P + R)	Balance between precision and recall; imbalanced classes

Q14: Regularization (L1 vs L2)

Question: Explain regularization. When would you use L1 vs L2?

Expected answer:

Regularization adds penalty to loss function to prevent overfitting.

Type	Penalty	Effect	Use Case
L1 (Lasso)	Sum of absolute coefficients	Can shrink coefficients to exactly zero	Feature selection
L2 (Ridge)	Sum of squared coefficients	Shrinks coefficients toward zero	All features potentially relevant
Elastic Net	Combination	Both effects	Balance between L1 and L2

Q15: Cross-Validation

Question: What is cross-validation and when would you use k-fold vs leave-one-out?

Expected answer:

Cross-validation splits data into k parts, trains on k-1, tests on remaining part, and rotates.

Type	Description	Use Case
k-fold (k=5 or 10)	Standard, good balance of variance and computation	General purpose
Leave-one-out	k = n, high variance, expensive	Very small datasets only
Stratified k-fold	Maintains class distribution	Imbalanced data
Time-series split	Respects temporal order	Temporal data (prevents future data leakage)

Product Analytics

Q16: Feature Success Metrics

Question: How would you define success metrics for a new feature?

Expected answer:

Framework:

Identify feature goal (engagement, revenue, retention)
Define target user segment
Identify expected behavior change

Metric categories:

Type	Purpose	Example
Primary	Determines success	Day 7 retention
Secondary	Additional context	Time to first action, features used
Guardrail	Should not degrade	Support tickets, drop-off rate

Q17: Metric Drop Investigation

Question: DAU dropped 5% this week. How do you investigate?

Expected answer:

Structured approach:

Validate data: Is drop real or measurement issue?
Segment: Which users/platforms/regions affected?
Timeline: When exactly did it start? Any deployments?
External factors: Holidays, competitor launches, seasonality?
Correlated metrics: What else changed?

Common causes:

Bug or deployment issue
Seasonality
One-time event (outage, news)
Actual user behavior change

Q18: Good Metric Criteria

Question: What makes a good metric?

Expected answer:

Criterion	Description
Measurable	Can compute from available data
Understandable	Team agrees on definition
Actionable	Can influence with product changes
Attributable	Can tie changes to specific actions
Timely	Available quickly enough to act on
Comparable	Meaningful across time/segments

Poor metric examples:

"User satisfaction" (difficult to measure)
Revenue per lifetime (too delayed)
Total users ever (only increases, not actionable)

Q19: Correlation vs Causation

Question: Explain the difference between correlation and causation.

Expected answer:

Concept	Definition
Correlation	Two variables move together
Causation	One variable directly influences another

Why correlation does not imply causation:

Confounding: Third variable affects both (ice cream and drowning both caused by summer)
Reverse causality: Direction is opposite (hospitals have high death rates because sick people go there)
Coincidence: Spurious correlation (divorce rate in Maine correlates with margarine consumption)

Requirements for establishing causation:

Correlation exists
Temporal order (cause precedes effect)
No confounders (or adjusted for them)
Plausible mechanism
Ideally, randomized experiment (gold standard)

Q20: Survivorship Bias

Question: What is survivorship bias?

Expected answer:

Survivorship bias occurs when analysis includes only successful cases because failures are not visible.

Example: WWII aircraft analysis examined planes returning from missions and found bullet holes in certain areas. Initial recommendation: add armor there. Correct insight: armor where holes are NOT present. Planes hit in those areas did not return.

Data science applications:

Studying successful startups without examining failures
Analyzing active users without considering churned users
Examining completed purchases without abandoned carts

Mitigation: Actively seek "missing" data. Analyze churned users, failed experiments, abandoned processes.

SQL & Data Manipulation

Q21: Window Functions vs GROUP BY

Question: Explain window functions vs GROUP BY.

Expected answer:

Feature	GROUP BY	Window Functions
Output rows	One per group	All original rows
Aggregation	Collapses rows	Computes across related rows

GROUP BY example: Selecting department and average salary grouped by department returns one row per department.

Window function example: Selecting name, salary, and average salary partitioned by department returns every employee row with their department's average added as a new column.

Window function applications:

Running totals
Rankings within groups
Comparing to group averages
Lag/lead analysis

Q22: Second Highest Salary

Question: How do you find the second highest salary?

Expected answer:

Using DENSE_RANK (handles ties): Create a subquery that assigns DENSE_RANK ordered by salary descending, then filter for rank = 2 in the outer query.

Using LIMIT/OFFSET: Select distinct salaries ordered descending, skip the first one (OFFSET 1), and take one result (LIMIT 1).

Using MAX: Find the maximum salary that is less than the overall maximum salary.

DENSE_RANK is most robust for "Nth highest" questions with potential ties.

Q23: CTEs

Question: Explain CTEs and when to use them.

Expected answer:

CTEs (Common Table Expressions) create named temporary result sets.

Example: Define a CTE called active_users that selects user IDs where status is 'active', then use it in the main query to filter orders to only those from active users.

Use cases:

Query has multiple steps (improves readability)
Same subquery used multiple times
Recursive queries
Breaking down complex logic

Q24: WHERE vs HAVING

Question: What is the difference between WHERE and HAVING?

Expected answer:

Clause	Timing	Purpose
WHERE	Before grouping	Filters individual rows
HAVING	After aggregation	Filters groups

Example: A query that counts employees per department would use WHERE to filter for active employees (before grouping), then HAVING to keep only departments with more than 10 active employees (after grouping).

Execution order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY

Q25: NULL Handling

Question: How do you handle NULL values in SQL?

Expected answer:

NULL behavior:

NULL = NULL returns NULL (not TRUE)
Use IS NULL or IS NOT NULL for comparisons
COUNT(*) counts all rows; COUNT(column) excludes NULLs
SUM/AVG ignore NULLs
NULL in comparisons produces NULL (unknown)

Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULLs, explicitly add OR status IS NULL.

Default values: Use COALESCE to provide default values for NULLs (e.g., replace NULL middle_name with empty string).

Statistics & Probability​

Q1: Central Limit Theorem​

Q2: Type I and Type II Errors​

Q3: P-Value​

Q4: T-test vs Z-test​

Q5: Bayes' Theorem​

A/B Testing & Experimentation​

Q6: Sample Size Calculation​

Q7: Statistical vs Practical Significance​

Q8: Multiple Testing Problem​

Q9: Simpson's Paradox​

Q10: Novelty and Primacy Effects​

Machine Learning​

Q11: Bias-Variance Trade-off​

Q12: Class Imbalance​

Q13: Precision vs Recall vs F1​

Q14: Regularization (L1 vs L2)​

Q15: Cross-Validation​

Product Analytics​

Q16: Feature Success Metrics​

Q17: Metric Drop Investigation​

Q18: Good Metric Criteria​

Q19: Correlation vs Causation​

Q20: Survivorship Bias​

SQL & Data Manipulation​

Q21: Window Functions vs GROUP BY​

Q22: Second Highest Salary​

Q23: CTEs​

Q24: WHERE vs HAVING​

Q25: NULL Handling​

Table of Contents

Statistics & Probability

Q1: Central Limit Theorem

Q2: Type I and Type II Errors

Q3: P-Value

Q4: T-test vs Z-test

Q5: Bayes' Theorem

A/B Testing & Experimentation

Q6: Sample Size Calculation

Q7: Statistical vs Practical Significance

Q8: Multiple Testing Problem

Q9: Simpson's Paradox

Q10: Novelty and Primacy Effects

Machine Learning

Q11: Bias-Variance Trade-off

Q12: Class Imbalance

Q13: Precision vs Recall vs F1

Q14: Regularization (L1 vs L2)

Q15: Cross-Validation

Product Analytics

Q16: Feature Success Metrics

Q17: Metric Drop Investigation

Q18: Good Metric Criteria

Q19: Correlation vs Causation

Q20: Survivorship Bias

SQL & Data Manipulation

Q21: Window Functions vs GROUP BY

Q22: Second Highest Salary

Q23: CTEs

Q24: WHERE vs HAVING

Q25: NULL Handling