Skip to main content

Machine Learning for Data Science Interviews

Machine learning interviews for data science roles focus on algorithm selection, understanding trade-offs, and explaining concepts clearly. This section covers the core knowledge areas.

Learning Paradigms

ParadigmDescriptionExamples
Supervised learningLabeled training dataClassification, regression
Unsupervised learningNo labelsClustering, dimensionality reduction

Supervised Learning Tasks

TaskOutputExamples
ClassificationDiscrete categorySpam detection, fraud classification
RegressionContinuous valuePrice prediction, demand forecasting

Classification Algorithms

Logistic Regression

Binary classification algorithm that predicts class probabilities.

CharacteristicDescription
Use caseBinary classification, baseline model
AdvantagesFast training, interpretable coefficients, minimal hyperparameters
LimitationsAssumes linear decision boundary, limited pattern complexity

Decision Trees

Recursive partitioning based on feature thresholds.

CharacteristicDescription
Use caseInterpretable models, mixed feature types
AdvantagesHandles non-linearity, no feature scaling required
LimitationsProne to overfitting, unstable with data changes

Random Forest

Ensemble of decision trees trained on random data subsets.

CharacteristicDescription
Use caseGeneral-purpose classification, feature importance
AdvantagesRobust, handles high dimensions, reduced overfitting
LimitationsSlower than single tree, less interpretable

Gradient Boosting (XGBoost, LightGBM)

Sequential tree training where each tree corrects previous errors.

CharacteristicDescription
Use caseTabular data, maximum accuracy requirements
AdvantagesHigh accuracy on structured data
LimitationsRequires hyperparameter tuning, can overfit, slower training

Algorithm Selection Guide

ScenarioRecommended Algorithm
Baseline modelLogistic regression
Interpretability requiredLogistic regression, decision tree
General purposeRandom forest
Maximum accuracy on tabular dataXGBoost/LightGBM
Text classificationNaive Bayes, logistic regression, transformer models

Bias-Variance Trade-off

Model error decomposes into three components:

Total Error = Bias^2 + Variance + Irreducible Noise

ComponentDescriptionSymptom
BiasModel too simple, misses patternsUnderfitting
VarianceModel too complex, fits noiseOverfitting

Diagnosis

ObservationProblemSolutions
High training error, high test errorHigh biasMore features, complex model, less regularization
Low training error, high test errorHigh varianceMore data, simpler model, more regularization

Model Evaluation

Classification Metrics

MetricFormulaUse Case
AccuracyCorrect / TotalBalanced classes only
PrecisionTP / (TP + FP)False positives are costly
RecallTP / (TP + FN)False negatives are costly
F1 Score2 * (P * R) / (P + R)Balance precision and recall
AUC-ROCArea under ROC curveThreshold-independent ranking

Note: Accuracy is misleading for imbalanced datasets. A model predicting the majority class achieves high accuracy without identifying minority cases.

Regression Metrics

MetricDescription
MSEMean squared error, penalizes large errors
RMSESquare root of MSE, same units as target
MAEMean absolute error, robust to outliers
Proportion of variance explained

Cross-Validation

Cross-validation provides reliable performance estimates by using multiple train/test splits.

K-Fold Cross-Validation

  1. Split data into k equal parts
  2. Train on k-1 parts, test on remaining part
  3. Rotate and repeat k times
  4. Average results across folds
ParameterCommon Values
k5 or 10

Cross-Validation Variants

TypeUse Case
Stratified k-foldImbalanced classes
Time series splitTemporal data
Leave-one-outSmall datasets

Feature Engineering

Feature Scaling

MethodFormulaUse Case
StandardScaler(x - mean) / stdSVM, neural networks, KNN
MinMaxScaler(x - min) / (max - min)Bounded output required

Missing Value Handling

ApproachDescription
Drop rowsFew missing values
Mean/median imputationNumeric features
Missing indicatorCreate binary flag for missingness

Categorical Variable Encoding

MethodDescriptionUse Case
One-hot encodingBinary column per categoryNominal categories
Label encodingNumeric encodingOrdinal categories
Target encodingReplace with target meanHigh cardinality (with leakage precautions)

Feature Selection

MethodDescription
Correlation analysisRemove low-correlation features
Tree feature importanceUse random forest importance scores
Lasso (L1)Automatic selection via coefficient zeroing

Handling Imbalanced Data

TechniqueDescription
Oversampling (SMOTE)Generate synthetic minority examples
UndersamplingReduce majority class examples
Class weightsIncrease penalty for minority misclassification
Threshold tuningAdjust classification threshold below 0.5

Regularization

Regularization prevents overfitting by penalizing model complexity.

TypePenaltyEffect
L1 (Lasso)Sum of absolute coefficientsCan zero out features
L2 (Ridge)Sum of squared coefficientsShrinks all coefficients
DropoutRandom neuron deactivationNeural network regularization
Early stoppingHalt training when validation error increasesPrevents overtraining

Common Interview Topics

Class Imbalance Handling

Approach for 95% class A, 5% class B scenario:

  1. Replace accuracy with precision/recall or AUC-ROC
  2. Apply class weights to penalize minority misclassification
  3. Use SMOTE for synthetic minority sample generation
  4. Evaluate threshold values below 0.5

High Training / Low Test Accuracy

Indicates overfitting. Solutions:

  • Increase training data
  • Apply regularization
  • Use simpler model
  • Reduce feature count
  • Implement early stopping for neural networks

Random Forest vs Gradient Boosting

FactorRandom ForestGradient Boosting
Tuning requiredMinimalSignificant
Overfitting riskLowerHigher
Training speedFasterSlower
Accuracy ceilingGoodHigher potential

Feature Selection Methods

  1. Analyze correlation with target variable
  2. Remove highly correlated redundant features
  3. Use tree-based feature importance
  4. Apply Lasso for automatic selection
  5. Incorporate domain knowledge

Common Pitfalls

PitfallDescription
Data leakageUsing test information during training
Single split evaluationResults depend on specific random split
Ignoring class imbalanceHigh accuracy masks poor minority detection
Post-split feature engineeringFeatures computed before train/test split
Validation overfittingExcessive hyperparameter tuning on validation set

Algorithm Reference

TaskInitial ModelAdvanced Model
Binary classificationLogistic regressionRandom forest, XGBoost
Multi-class classificationLogistic regressionRandom forest, XGBoost
RegressionLinear regressionRandom forest, XGBoost
High-dimensional dataLassoRandom forest
Interpretability requiredLogistic regressionDecision tree
Text dataNaive BayesLogistic + TF-IDF, transformer models