Skip to main content

Data Science Concept Questions

This section covers concept questions that appear frequently in data science interviews, along with expected answer frameworks.

Statistics & Probability

Q1: Central Limit Theorem

Question: What is the Central Limit Theorem and why does it matter?

Expected answer:

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution.

Significance:

  • Statistical methods assume normality (confidence intervals, hypothesis tests, regression)
  • The CLT allows these methods to work even when underlying data is non-normal
  • Rule of thumb: n >= 30, though more skewed distributions require larger samples

Q2: Type I and Type II Errors

Question: Explain the difference between Type I and Type II errors.

Expected answer:

Error TypeDefinitionExample (Medical Testing)
Type I (false positive)Detecting effect that does not existTreating healthy patient
Type II (false negative)Missing effect that existsMissing disease

Relationship:

  • alpha controls Type I error rate (typically 0.05)
  • Power = 1 - beta = probability of detecting true effect
  • Trade-off: Decreasing alpha increases Type II error rate

Context dependence: Which error is worse depends on consequences. For serious disease with good treatment, Type II (missing diagnosis) is worse. For risky treatment, Type I (unnecessary treatment) may be worse.

Q3: P-Value

Question: What is a p-value? What does p = 0.03 mean?

Expected answer:

Definition: The p-value is the probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true.

Interpretation of p = 0.03: If there were truly no effect, data this extreme would occur only 3% of the time.

Common misconceptions (what p-value is NOT):

  • Probability that the null hypothesis is true
  • Probability that the effect is real
  • Probability of making an error
  • Indicator of effect size (large samples can produce small p-values for negligible effects)

Q4: T-test vs Z-test

Question: When would you use a t-test vs a z-test?

Expected answer:

TestUse When
Z-testPopulation standard deviation known (rare)
T-testPopulation standard deviation estimated from sample (common)

Practical guidance: Use t-tests. Population standard deviation is rarely known. With large samples (n > 30), t-tests and z-tests produce nearly identical results.

T-test variants:

TypeApplication
One-sampleCompare sample mean to known value
Two-sampleCompare means of two independent groups
PairedCompare before/after on same subjects

Q5: Bayes' Theorem

Question: What is Bayes' theorem? Provide a practical example.

Expected answer:

Formula:

P(A|B) = P(B|A) x P(A) / P(B)

Medical test example:

  • Disease prevalence: 1%
  • Test sensitivity (true positive rate): 95%
  • Test specificity (true negative rate): 90%
  • Question: Given positive test, what is probability of disease?

Calculation:

Using Bayes' theorem: P(Disease|Positive) = P(Positive|Disease) x P(Disease) / P(Positive)

First, calculate P(Positive) = (0.95 x 0.01) + (0.10 x 0.99) = 0.1085

Then, P(Disease|Positive) = (0.95 x 0.01) / 0.1085 = approximately 8.8%

Interpretation: Despite positive test, probability of disease is approximately 8.8%. The low base rate (1% prevalence) means most positive tests are false positives.

A/B Testing & Experimentation

Q6: Sample Size Calculation

Question: How do you calculate sample size for an A/B test?

Expected answer:

Required inputs:

  1. Baseline rate (current metric value)
  2. Minimum Detectable Effect (MDE)
  3. Significance level (alpha, typically 0.05)
  4. Power (1-beta, typically 0.80)

Relationships:

  • Smaller effects require larger samples
  • Higher power requires larger samples
  • Higher variance requires larger samples

Practical constraint: Sample size is often the bottleneck. If detecting a 1% lift requires 2 million users but traffic is 100K/week, either run for 20 weeks or accept detection of only larger effects.

Q7: Statistical vs Practical Significance

Question: A/B test shows p = 0.04 but effect size is small. What do you recommend?

Expected answer:

Statistical significance indicates the result is unlikely due to chance. It does not indicate practical importance.

Evaluation framework:

  1. Confidence interval: If CI is [0.01%, 0.5%], true effect is small regardless
  2. Business impact: 0.1% lift on billion-dollar metric = $1M; on small feature = $100
  3. Implementation cost: Engineering time, maintenance, complexity
  4. Compounding potential: Many small improvements may compound, but distinguish from noise

Recommendation: Report finding accurately. Shipping decision depends on implementation cost relative to expected value.

Q8: Multiple Testing Problem

Question: What is the multiple testing problem and how do you address it?

Expected answer:

Testing multiple hypotheses inflates false positive rate. With 20 tests at alpha = 0.05, approximately 64% probability of at least one false positive.

Solutions:

MethodApproach
Bonferroni correctionalpha_adjusted = alpha / number_of_tests (conservative)
False Discovery Rate (FDR)Controls expected proportion of false discoveries (less conservative)
Pre-registrationDefine primary hypothesis before analysis
Hierarchical testingTest primary metric first

Q9: Simpson's Paradox

Question: Explain Simpson's Paradox with an example.

Expected answer:

A trend in aggregated data reverses when split by a confounding variable.

Example: UC Berkeley admissions

  • Overall: Men admitted at higher rate
  • By department: Women admitted at higher rate in most departments
  • Explanation: Women applied to more competitive departments

Implication: Always check for confounders. Aggregate data can be misleading.

Q10: Novelty and Primacy Effects

Question: What are novelty and primacy effects in A/B testing?

Expected answer:

EffectDescriptionDuration
NoveltyIncreased engagement with new features because they are newFades over 2-3 weeks
PrimacyResistance to change; initial negative reactionTypically 1-2 weeks

Mitigation strategies:

  • Run tests for at least 2 weeks
  • Exclude returning users initially
  • Monitor metrics over time, not just aggregate
  • Use holdout groups for long-term measurement

Machine Learning

Q11: Bias-Variance Trade-off

Question: Explain the bias-variance trade-off.

Expected answer:

Model error decomposes into:

Total Error = Bias^2 + Variance + Irreducible Noise

ComponentDefinitionSymptom
BiasModel too simple, misses patternsUnderfitting
VarianceModel too sensitive to training data, fits noiseOverfitting

Diagnosis:

ObservationProblemSolution
High training error, high test errorHigh biasMore features, complex model, less regularization
Low training error, high test errorHigh varianceMore data, simpler model, more regularization

Q12: Class Imbalance

Question: How do you handle a dataset with 95% class A and 5% class B?

Expected answer:

First: Replace accuracy with appropriate metrics (precision, recall, F1, PR-AUC). Accuracy is misleading for imbalanced data.

Techniques:

TechniqueMethod
Resampling (SMOTE)Generate synthetic minority examples
UndersamplingReduce majority class examples
Class weightsPenalize minority misclassification more heavily
Threshold tuningAdjust classification threshold below 0.5

Q13: Precision vs Recall vs F1

Question: When would you use precision vs recall vs F1?

Expected answer:

MetricFormulaOptimize When
PrecisionTP / (TP + FP)False positives are costly (spam filter: do not lose real emails)
RecallTP / (TP + FN)False negatives are costly (cancer screening: do not miss cases)
F12 * (P * R) / (P + R)Balance between precision and recall; imbalanced classes

Q14: Regularization (L1 vs L2)

Question: Explain regularization. When would you use L1 vs L2?

Expected answer:

Regularization adds penalty to loss function to prevent overfitting.

TypePenaltyEffectUse Case
L1 (Lasso)Sum of absolute coefficientsCan shrink coefficients to exactly zeroFeature selection
L2 (Ridge)Sum of squared coefficientsShrinks coefficients toward zeroAll features potentially relevant
Elastic NetCombinationBoth effectsBalance between L1 and L2

Q15: Cross-Validation

Question: What is cross-validation and when would you use k-fold vs leave-one-out?

Expected answer:

Cross-validation splits data into k parts, trains on k-1, tests on remaining part, and rotates.

TypeDescriptionUse Case
k-fold (k=5 or 10)Standard, good balance of variance and computationGeneral purpose
Leave-one-outk = n, high variance, expensiveVery small datasets only
Stratified k-foldMaintains class distributionImbalanced data
Time-series splitRespects temporal orderTemporal data (prevents future data leakage)

Product Analytics

Q16: Feature Success Metrics

Question: How would you define success metrics for a new feature?

Expected answer:

Framework:

  1. Identify feature goal (engagement, revenue, retention)
  2. Define target user segment
  3. Identify expected behavior change

Metric categories:

TypePurposeExample
PrimaryDetermines successDay 7 retention
SecondaryAdditional contextTime to first action, features used
GuardrailShould not degradeSupport tickets, drop-off rate

Q17: Metric Drop Investigation

Question: DAU dropped 5% this week. How do you investigate?

Expected answer:

Structured approach:

  1. Validate data: Is drop real or measurement issue?
  2. Segment: Which users/platforms/regions affected?
  3. Timeline: When exactly did it start? Any deployments?
  4. External factors: Holidays, competitor launches, seasonality?
  5. Correlated metrics: What else changed?

Common causes:

  • Bug or deployment issue
  • Seasonality
  • One-time event (outage, news)
  • Actual user behavior change

Q18: Good Metric Criteria

Question: What makes a good metric?

Expected answer:

CriterionDescription
MeasurableCan compute from available data
UnderstandableTeam agrees on definition
ActionableCan influence with product changes
AttributableCan tie changes to specific actions
TimelyAvailable quickly enough to act on
ComparableMeaningful across time/segments

Poor metric examples:

  • "User satisfaction" (difficult to measure)
  • Revenue per lifetime (too delayed)
  • Total users ever (only increases, not actionable)

Q19: Correlation vs Causation

Question: Explain the difference between correlation and causation.

Expected answer:

ConceptDefinition
CorrelationTwo variables move together
CausationOne variable directly influences another

Why correlation does not imply causation:

  • Confounding: Third variable affects both (ice cream and drowning both caused by summer)
  • Reverse causality: Direction is opposite (hospitals have high death rates because sick people go there)
  • Coincidence: Spurious correlation (divorce rate in Maine correlates with margarine consumption)

Requirements for establishing causation:

  1. Correlation exists
  2. Temporal order (cause precedes effect)
  3. No confounders (or adjusted for them)
  4. Plausible mechanism
  5. Ideally, randomized experiment (gold standard)

Q20: Survivorship Bias

Question: What is survivorship bias?

Expected answer:

Survivorship bias occurs when analysis includes only successful cases because failures are not visible.

Example: WWII aircraft analysis examined planes returning from missions and found bullet holes in certain areas. Initial recommendation: add armor there. Correct insight: armor where holes are NOT present. Planes hit in those areas did not return.

Data science applications:

  • Studying successful startups without examining failures
  • Analyzing active users without considering churned users
  • Examining completed purchases without abandoned carts

Mitigation: Actively seek "missing" data. Analyze churned users, failed experiments, abandoned processes.

SQL & Data Manipulation

Q21: Window Functions vs GROUP BY

Question: Explain window functions vs GROUP BY.

Expected answer:

FeatureGROUP BYWindow Functions
Output rowsOne per groupAll original rows
AggregationCollapses rowsComputes across related rows

GROUP BY example: Selecting department and average salary grouped by department returns one row per department.

Window function example: Selecting name, salary, and average salary partitioned by department returns every employee row with their department's average added as a new column.

Window function applications:

  • Running totals
  • Rankings within groups
  • Comparing to group averages
  • Lag/lead analysis

Q22: Second Highest Salary

Question: How do you find the second highest salary?

Expected answer:

Using DENSE_RANK (handles ties): Create a subquery that assigns DENSE_RANK ordered by salary descending, then filter for rank = 2 in the outer query.

Using LIMIT/OFFSET: Select distinct salaries ordered descending, skip the first one (OFFSET 1), and take one result (LIMIT 1).

Using MAX: Find the maximum salary that is less than the overall maximum salary.

DENSE_RANK is most robust for "Nth highest" questions with potential ties.

Q23: CTEs

Question: Explain CTEs and when to use them.

Expected answer:

CTEs (Common Table Expressions) create named temporary result sets.

Example: Define a CTE called active_users that selects user IDs where status is 'active', then use it in the main query to filter orders to only those from active users.

Use cases:

  • Query has multiple steps (improves readability)
  • Same subquery used multiple times
  • Recursive queries
  • Breaking down complex logic

Q24: WHERE vs HAVING

Question: What is the difference between WHERE and HAVING?

Expected answer:

ClauseTimingPurpose
WHEREBefore groupingFilters individual rows
HAVINGAfter aggregationFilters groups

Example: A query that counts employees per department would use WHERE to filter for active employees (before grouping), then HAVING to keep only departments with more than 10 active employees (after grouping).

Execution order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY

Q25: NULL Handling

Question: How do you handle NULL values in SQL?

Expected answer:

NULL behavior:

  • NULL = NULL returns NULL (not TRUE)
  • Use IS NULL or IS NOT NULL for comparisons
  • COUNT(*) counts all rows; COUNT(column) excludes NULLs
  • SUM/AVG ignore NULLs
  • NULL in comparisons produces NULL (unknown)

Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULLs, explicitly add OR status IS NULL.

Default values: Use COALESCE to provide default values for NULLs (e.g., replace NULL middle_name with empty string).