Machine Learning for Data Science Interviews
Machine learning interviews for data science roles focus on algorithm selection, understanding trade-offs, and explaining concepts clearly. This section covers the core knowledge areas.
| Paradigm | Description | Examples |
|---|
| Supervised learning | Labeled training data | Classification, regression |
| Unsupervised learning | No labels | Clustering, dimensionality reduction |
| Task | Output | Examples |
|---|
| Classification | Discrete category | Spam detection, fraud classification |
| Regression | Continuous value | Price prediction, demand forecasting |
Binary classification algorithm that predicts class probabilities.
| Characteristic | Description |
|---|
| Use case | Binary classification, baseline model |
| Advantages | Fast training, interpretable coefficients, minimal hyperparameters |
| Limitations | Assumes linear decision boundary, limited pattern complexity |
Recursive partitioning based on feature thresholds.
| Characteristic | Description |
|---|
| Use case | Interpretable models, mixed feature types |
| Advantages | Handles non-linearity, no feature scaling required |
| Limitations | Prone to overfitting, unstable with data changes |
Ensemble of decision trees trained on random data subsets.
| Characteristic | Description |
|---|
| Use case | General-purpose classification, feature importance |
| Advantages | Robust, handles high dimensions, reduced overfitting |
| Limitations | Slower than single tree, less interpretable |
Sequential tree training where each tree corrects previous errors.
| Characteristic | Description |
|---|
| Use case | Tabular data, maximum accuracy requirements |
| Advantages | High accuracy on structured data |
| Limitations | Requires hyperparameter tuning, can overfit, slower training |
| Scenario | Recommended Algorithm |
|---|
| Baseline model | Logistic regression |
| Interpretability required | Logistic regression, decision tree |
| General purpose | Random forest |
| Maximum accuracy on tabular data | XGBoost/LightGBM |
| Text classification | Naive Bayes, logistic regression, transformer models |
Model error decomposes into three components:
Total Error = Bias^2 + Variance + Irreducible Noise
| Component | Description | Symptom |
|---|
| Bias | Model too simple, misses patterns | Underfitting |
| Variance | Model too complex, fits noise | Overfitting |
| Observation | Problem | Solutions |
|---|
| High training error, high test error | High bias | More features, complex model, less regularization |
| Low training error, high test error | High variance | More data, simpler model, more regularization |
| Metric | Formula | Use Case |
|---|
| Accuracy | Correct / Total | Balanced classes only |
| Precision | TP / (TP + FP) | False positives are costly |
| Recall | TP / (TP + FN) | False negatives are costly |
| F1 Score | 2 * (P * R) / (P + R) | Balance precision and recall |
| AUC-ROC | Area under ROC curve | Threshold-independent ranking |
Note: Accuracy is misleading for imbalanced datasets. A model predicting the majority class achieves high accuracy without identifying minority cases.
| Metric | Description |
|---|
| MSE | Mean squared error, penalizes large errors |
| RMSE | Square root of MSE, same units as target |
| MAE | Mean absolute error, robust to outliers |
| R² | Proportion of variance explained |
Cross-validation provides reliable performance estimates by using multiple train/test splits.
- Split data into k equal parts
- Train on k-1 parts, test on remaining part
- Rotate and repeat k times
- Average results across folds
| Parameter | Common Values |
|---|
| k | 5 or 10 |
| Type | Use Case |
|---|
| Stratified k-fold | Imbalanced classes |
| Time series split | Temporal data |
| Leave-one-out | Small datasets |
| Method | Formula | Use Case |
|---|
| StandardScaler | (x - mean) / std | SVM, neural networks, KNN |
| MinMaxScaler | (x - min) / (max - min) | Bounded output required |
| Approach | Description |
|---|
| Drop rows | Few missing values |
| Mean/median imputation | Numeric features |
| Missing indicator | Create binary flag for missingness |
| Method | Description | Use Case |
|---|
| One-hot encoding | Binary column per category | Nominal categories |
| Label encoding | Numeric encoding | Ordinal categories |
| Target encoding | Replace with target mean | High cardinality (with leakage precautions) |
| Method | Description |
|---|
| Correlation analysis | Remove low-correlation features |
| Tree feature importance | Use random forest importance scores |
| Lasso (L1) | Automatic selection via coefficient zeroing |
| Technique | Description |
|---|
| Oversampling (SMOTE) | Generate synthetic minority examples |
| Undersampling | Reduce majority class examples |
| Class weights | Increase penalty for minority misclassification |
| Threshold tuning | Adjust classification threshold below 0.5 |
Regularization prevents overfitting by penalizing model complexity.
| Type | Penalty | Effect |
|---|
| L1 (Lasso) | Sum of absolute coefficients | Can zero out features |
| L2 (Ridge) | Sum of squared coefficients | Shrinks all coefficients |
| Dropout | Random neuron deactivation | Neural network regularization |
| Early stopping | Halt training when validation error increases | Prevents overtraining |
Approach for 95% class A, 5% class B scenario:
- Replace accuracy with precision/recall or AUC-ROC
- Apply class weights to penalize minority misclassification
- Use SMOTE for synthetic minority sample generation
- Evaluate threshold values below 0.5
Indicates overfitting. Solutions:
- Increase training data
- Apply regularization
- Use simpler model
- Reduce feature count
- Implement early stopping for neural networks
| Factor | Random Forest | Gradient Boosting |
|---|
| Tuning required | Minimal | Significant |
| Overfitting risk | Lower | Higher |
| Training speed | Faster | Slower |
| Accuracy ceiling | Good | Higher potential |
- Analyze correlation with target variable
- Remove highly correlated redundant features
- Use tree-based feature importance
- Apply Lasso for automatic selection
- Incorporate domain knowledge
| Pitfall | Description |
|---|
| Data leakage | Using test information during training |
| Single split evaluation | Results depend on specific random split |
| Ignoring class imbalance | High accuracy masks poor minority detection |
| Post-split feature engineering | Features computed before train/test split |
| Validation overfitting | Excessive hyperparameter tuning on validation set |
| Task | Initial Model | Advanced Model |
|---|
| Binary classification | Logistic regression | Random forest, XGBoost |
| Multi-class classification | Logistic regression | Random forest, XGBoost |
| Regression | Linear regression | Random forest, XGBoost |
| High-dimensional data | Lasso | Random forest |
| Interpretability required | Logistic regression | Decision tree |
| Text data | Naive Bayes | Logistic + TF-IDF, transformer models |