Skip to main content

ML Concepts Interview Questions

This document covers common ML theory questions that appear in interviews, including expected answers and relevant context.

Supervised Learning

Q1: Difference between classification and regression

AspectClassificationRegression
OutputDiscrete class labelsContinuous values
ExamplesSpam/not spam, cat/dog/birdHouse price, temperature
Loss functionsCross-entropyMSE, MAE
Shared algorithmsDecision trees, neural networks can perform both

Q2: Parametric vs non-parametric models

AspectParametricNon-parametric
ParametersFixed count regardless of data sizeGrows with data
ExamplesLinear regression, logistic regression, naive BayesKNN, decision trees, SVM with RBF kernel
AssumptionsDistribution assumptions about dataFewer assumptions
Training/InferenceFasterSlower
ComplexityMay underfit complex patternsRisk of overfitting

Q3: Logistic regression vs linear regression

AspectLinear RegressionLogistic Regression
OutputContinuous valuesProbabilities (0-1)
FunctionLinear: y = wx + bSigmoid: 1/(1 + e^(-z))
LossMSELog loss (cross-entropy)
AssumptionsNormal errorsNo normality assumption
BoundsUnboundedBounded [0, 1]

Q4: Decision tree splitting

Decision trees evaluate all features and possible split points, selecting the split that maximizes information gain or minimizes impurity.

CriterionUse CaseFormula
Gini impurityClassification1 - Sum(pi^2)
Entropy/Information GainClassification-Sum(pi * log2(pi))
Variance reductionRegressionVariance before - weighted variance after

The algorithm is greedy and does not guarantee globally optimal splits.

Q5: Random forests vs single decision trees

MechanismDescription
BaggingEach tree trains on a bootstrap sample (random subset with replacement)
Feature randomizationEach split considers only a random subset of features
Ensemble averagingReduces variance while maintaining bias

Individual trees overfit in different ways; averaging cancels noise. Trade-off: reduced interpretability compared to single trees.

Ensemble Methods

Q6: Bagging vs boosting

AspectBaggingBoosting
TrainingParallel, independentSequential, each learns from previous errors
Data samplingBootstrap (with replacement)Weighted based on previous errors
GoalReduce varianceReduce bias
ExamplesRandom ForestXGBoost, AdaBoost
Overfitting riskLowerHigher with too many rounds

Q7: Gradient boosting mechanism

  1. Start with initial prediction (often mean for regression)
  2. Calculate residuals (errors) from current predictions
  3. Fit new weak learner to predict residuals
  4. Add new learner's predictions (scaled by learning rate) to ensemble
  5. Repeat steps 2-4
HyperparameterEffect
Learning rateSmaller values require more trees but improve generalization
Number of treesMore trees increase overfitting risk
Max depthShallower trees create simpler weak learners

Q8: XGBoost vs LightGBM

AspectXGBoostLightGBM
Tree growthLevel-wise (complete level before next)Leaf-wise (grows leaf with largest loss reduction)
SpeedGenerally slowerGenerally faster, especially on large datasets
Categorical featuresRequires encodingNative handling
MemoryStandardMore efficient with histogram-based approach
Common featuresBoth support regularization, early stopping, GPU training

Neural Networks

Q9: Backpropagation

  1. Forward pass: Compute predictions by passing input through network
  2. Calculate loss: Measure prediction error
  3. Backward pass: Compute gradients using chain rule, starting from loss and working backwards through each layer
    • Gradient formula: dL/dw = dL/doutput * doutput/dw
  4. Update weights: w_new = w_old - learning_rate * gradient

The chain rule enables computing all gradients in one backward pass.

Q10: Vanishing gradient problem

In deep networks, gradients can become extremely small in early layers.

CauseDescription
Sigmoid/tanh activationDerivatives less than 1
Deep networksMany layers multiply small values
SymptomDescription
Early layers not learningGradients too small for meaningful updates
Training stallsLoss stops decreasing
SolutionMechanism
ReLU activationGradient is 1 for positive values
Batch normalizationNormalizes layer inputs
Residual connectionsGradient flows directly through skip connections
Better initializationXavier/He initialization

Q11: Activation functions comparison

FunctionFormulaProsConsUse Case
Sigmoid1/(1+e^-x)Bounded (0,1), interpretableVanishing gradients, not zero-centeredOutput layer (binary)
Tanh(e^x - e^-x)/(e^x + e^-x)Zero-centeredVanishing gradientsRNNs (historically)
ReLUmax(0, x)No vanishing gradient, fastDead neurons, not zero-centeredHidden layers (default)
Leaky ReLUmax(0.01x, x)No dead neuronsSlightly more complexAlternative to ReLU
Softmaxe^xi / Sum(e^xj)Probabilities sum to 1Only for outputMulti-class output

Q12: Dropout mechanism

During training, randomly zero out a fraction of neurons. Different neurons are dropped each forward pass. At test time, scale activations to compensate (or use inverted dropout during training).

MechanismEffect
Prevents co-adaptationNo neuron can rely on specific other neurons
Ensemble effectTraining exponentially many different sub-networks
RegularizationNoise prevents overfitting

Typical rates: 0.2-0.5 for hidden layers, lower for input layers.

Q13: Batch normalization

Normalizes layer inputs: x_norm = (x - mu_batch) / sigma_batch

Includes learnable scale (gamma) and shift (beta) parameters. Applied after linear transformation, before activation.

BenefitDescription
Faster trainingHigher learning rates possible
Reduces internal covariate shiftStabilizes distribution of inputs
Regularization effectBatch statistics add noise
Reduces initialization sensitivityLess dependent on initial weights

At inference: Uses running mean/variance from training, not batch statistics.

Regularization

Q14: L1 vs L2 regularization

AspectL1 (Lasso)L2 (Ridge)
PenaltySum(abs(w))Sum(w^2)
Effect on weightsPushes to exactly zeroShrinks toward zero
Feature selectionYes (sparse solutions)No
GeometryDiamond constraintCircle constraint
Correlated featuresPicks one arbitrarilySpreads weight among them
Use caseFeature selection, interpretabilityWhen all features might matter

Q15: Early stopping

Stop training when validation loss stops improving. Monitor validation metric and stop after N epochs with no improvement.

BenefitDescription
Limits model complexityWithout adding explicit hyperparameter
Prevents overfittingStops before model memorizes training noise
Computational savingsNo wasted training epochs

Early training: Model learns real patterns. Later training: Model starts memorizing noise.

Bias-Variance Trade-off

Q16: Bias-variance decomposition

Total Error = Bias^2 + Variance + Irreducible Noise

ComponentDescription
BiasError from overly simple assumptions. Model consistently misses pattern.
VarianceError from sensitivity to training data. Model changes significantly with different samples.
Simple ModelComplex Model
High BiasLow Bias
Low VarianceHigh Variance
UnderfittingOverfitting

Increasing model complexity decreases bias but increases variance. Goal: find the minimum total error.

Q17: Diagnosing high bias vs high variance

MetricHigh BiasHigh Variance
Training errorHighLow
Validation errorHighHigh
Gap between train/valSmallLarge
ConditionSolutions
High biasMore features, more complex model, less regularization, longer training
High varianceMore data, simpler model, more regularization, dropout, early stopping

Model Evaluation

Q18: ROC-AUC vs precision-recall AUC

MetricDescriptionBest For
ROC-AUCTrue positive rate vs false positive rateBalanced classes
PR-AUCPrecision vs recallImbalanced datasets, focus on minority class

With imbalanced data, ROC-AUC can appear high even with poor minority class performance. PR-AUC is more sensitive to improvements on the minority class.

Q19: Precision, recall, and F1 score

MetricFormulaInterpretationOptimize When
PrecisionTP / (TP + FP)Of predictions positive, how many correctFalse positives are costly (spam filter)
RecallTP / (TP + FN)Of actual positives, how many foundFalse negatives are costly (cancer detection)
F12 * (Precision * Recall) / (Precision + Recall)Harmonic mean balancing bothBoth matter equally

Q20: K-fold cross-validation

  1. Split data into k folds
  2. Train on k-1 folds, validate on 1
  3. Rotate and repeat k times
  4. Average results
BenefitDescription
Reliable estimateMore stable than single split
Full data usageAll data used for both training and validation
Reduced variancePerformance estimate has lower variance

Typical k: 5 or 10. Stratified k-fold maintains class distribution in each fold.

Feature Engineering

Q21: Handling missing values

StrategyWhen to UseTrade-offs
Drop rowsFew missing, truly randomSimple but loses data
Drop columnsMore than 50% missingLoses entire feature
Mean/median imputationNumerical, random missingSimple but reduces variance
Mode imputationCategoricalSimple but may add bias
Model-based (KNN, regression)Missing depends on other featuresCaptures relationships but can overfit
Missing indicatorMissingness itself is informativePreserves signal, adds features

Consider whether data is missing randomly or with a pattern. If users with high income skip the income field, that is not random.

Q22: Handling categorical variables

MethodWhen to UseProsCons
One-hot encodingLow cardinality, nominalNo ordering assumedHigh dimensions
Label encodingOrdinal, tree modelsCompactImplies ordering
Target encodingHigh cardinalityCompact, uses targetLeakage risk
Frequency encodingMany categoriesNo leakageLoses information
EmbeddingVery high cardinalityLearns representationsRequires neural network

Always encode after train/test split to prevent leakage.

Q23: Feature scaling

When to scale:

  • Distance-based algorithms (KNN, SVM, k-means)
  • Gradient descent optimization (neural networks, logistic regression)
  • Regularized models (Lasso, Ridge)

When not needed:

  • Tree-based models (random forest, XGBoost) - splits are scale-invariant
MethodFormulaUse Case
StandardScaler(x - mean) / stdAssumes normal distribution
MinMaxScaler(x - min) / (max - min)Scales to [0, 1]
RobustScalerUses median and IQRRobust to outliers

Always fit scaler on training data only, then transform both train and test.

Imbalanced Data

Q24: Class imbalance handling strategies

CategoryTechniques
ResamplingOversample minority (SMOTE, ADASYN), undersample majority (Random, Tomek links), combination (SMOTE + Tomek)
Algorithm-levelClass weights (penalize minority misclassification more), cost-sensitive learning
Threshold adjustmentDefault 0.5 may not be optimal; tune based on precision-recall trade-off
EvaluationUse precision, recall, F1, PR-AUC instead of accuracy
Ensemble methodsBalancedRandomForest, EasyEnsemble

Q25: SMOTE (Synthetic Minority Over-sampling Technique)

  1. For each minority sample, find k nearest minority neighbors
  2. Randomly select one neighbor
  3. Create synthetic sample along the line between the two points
  4. x_synthetic = x_original + rand(0,1) * (x_neighbor - x_original)
AdvantageLimitation
Creates new examples rather than duplicatingCan create noisy samples if minority class is scattered
Helps classifier learn decision boundariesDoes not consider majority class - can create overlap

Variants: SMOTE-ENN, SMOTE-Tomek, ADASYN.

Dimensionality Reduction

Q26: PCA (Principal Component Analysis)

  1. Center the data (subtract mean)
  2. Compute covariance matrix
  3. Find eigenvectors (principal components) and eigenvalues
  4. Project data onto top k eigenvectors
PropertyDescription
Components are orthogonalCapture independent sources of variation
First componentCaptures the most variance
EigenvalueIndicates how much variance that component explains
Use CaseDescription
VisualizationReduce 100 features to 2-3 for plotting
Remove multicollinearityBefore regression
Noise reductionSmall components are often noise
Speed up trainingFewer features
LimitationDescription
Linear onlyCannot capture non-linear relationships
InterpretabilityPrincipal components are not original features
Variance assumptionAssumes variance equals importance

Q27: PCA vs t-SNE

AspectPCAt-SNE
GoalMaximize variancePreserve local structure
TypeLinearNon-linear
DeterministicYesNo (stochastic)
ScalabilityFast, scales wellSlow, O(n^2)
Inverse transformYesNo
Use casePreprocessing, speedVisualization

t-SNE focuses on keeping similar points close, not preserving global structure.

Unsupervised Learning

Q28: K-means clustering

  1. Initialize k centroids (randomly or k-means++)
  2. Assign each point to nearest centroid
  3. Update centroids to mean of assigned points
  4. Repeat 2-3 until convergence
Choosing kMethod
Elbow methodPlot inertia vs k, look for elbow
Silhouette scoreMeasures cluster cohesion vs separation
Domain knowledgeBased on problem requirements
LimitationDescription
Assumes spherical clustersDoes not handle irregular shapes well
Sensitive to initializationUse k-means++
Must specify k upfrontCannot discover number of clusters
Sensitive to outliersCan skew centroid positions

Q29: Hierarchical clustering vs k-means

Use Hierarchical WhenUse K-means When
Number of clusters unknownk is known or estimable
Exploring different granularities (dendrogram)Large datasets (hierarchical is O(n^3))
Clusters may have irregular shapesSpherical clusters expected
Deterministic results requiredSpeed matters

Advanced Topics

Q30: Attention mechanism in transformers

Attention computes weighted sum of values based on query-key similarity:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
ComponentDescription
Query (Q)What to look for
Key (K)What to match against
Value (V)What to retrieve
Scaling (sqrt(d_k))Prevents softmax saturation with large dimensions
TypeDescription
Self-attentionQ, K, V all come from same sequence
Multi-headMultiple attention computations in parallel, captures different relationships
BenefitDescription
Long-range dependenciesUnlike RNNs
ParallelizableUnlike sequential RNNs
InterpretableAttention weights show what model focuses on

Q31: Word2Vec vs BERT embeddings

AspectWord2VecBERT
ContextStatic (one embedding per word)Contextual (varies by sentence)
ArchitectureShallow (1-2 layers)Deep transformer (12-24 layers)
TrainingSkip-gram or CBOWMasked language modeling
Same word handlingSame embedding for "river bank" and "bank account"Different based on context
ComputeFastHeavy

BERT produces contextual embeddings where the same word gets different representations based on surrounding words.

Q32: GAN (Generative Adversarial Network)

Two networks:

  • Generator (G): Creates fake samples from random noise
  • Discriminator (D): Distinguishes real from fake

Training (minimax game):

  • D tries to maximize: correctly classify real vs fake
  • G tries to minimize: fool D into thinking fakes are real
min_G max_D V(D,G) = E[log(D(x))] + E[log(1-D(G(z)))]
ChallengeDescription
Mode collapseG produces limited variety
Training instabilityHard to balance G and D
No convergence guaranteeMay not reach equilibrium

Variants: DCGAN (convolutional), WGAN (Wasserstein loss), StyleGAN (style-based).

Model Deployment

Q33: Model drift types and detection

Drift TypeDescription
Data driftInput distribution changes; features look different than training
Concept driftRelationship between inputs and outputs changes; what used to predict fraud no longer does
Label driftTarget distribution changes; fraud rate goes from 1% to 5%
Detection MethodDescription
Monitor feature distributionsKL divergence, PSI
Track prediction distributionAre predictions changing?
Monitor performance metricsIs accuracy dropping?
Statistical testsCompare recent data to training data
ResponseDescription
Retrain on recent dataUpdate model with new patterns
Add new featuresCapture the change
Triggered retrainingRetrain when drift exceeds threshold
Online learningContinuous adaptation

Q34: Batch vs online prediction

FactorBatchOnline
Latency requirementCan wait (minutes/hours)Needed immediately (ms)
VolumeMany predictions at onceOne at a time
InfrastructureSimpler, scheduled jobsAPI, always-on service
Feature freshnessPoint-in-time featuresReal-time features
CostCheaper for large volumePay per request
Use CasePrediction Mode
Recommendation precomputationBatch
Churn predictionBatch
Report generationBatch
Fraud detectionOnline
Search rankingOnline
Real-time biddingOnline

Q35: A/B testing in ML

Process:

  1. Define metric (conversion, engagement, revenue)
  2. Randomly split users into control (old model) and treatment (new model)
  3. Collect data for statistical significance
  4. Analyze results, make decision
PitfallDescription
PeekingChecking results before reaching sample size
Multiple testingTesting many variants inflates false positives
Network effectsUsers influence each other (use cluster randomization)
Novelty/primacy effectsShort-term behavior differs from long-term
Simpson's paradoxAggregate results hide segment differences

Sample size calculation based on baseline rate, minimum detectable effect, and power (usually 80%).

Quick Reference

Q36: Curse of dimensionality

As dimensions increase, data becomes sparse. In 1D, 10 points might cover the space. In 100D, astronomical amounts of data are required. Distances become meaningless; everything is far from everything. KNN and other distance-based methods break down.

Q37: Generative vs discriminative models

TypeModelsLearns
GenerativeNaive Bayes, GANsP(X
DiscriminativeLogistic regression, SVMP(Y

Q38: Transfer learning

Using a model trained on one task as starting point for another. Freeze early layers (general features), fine-tune later layers (task-specific).

Q39: Epoch, batch, and iteration

TermDefinition
EpochOne pass through entire dataset
BatchSubset of data for one forward/backward pass
IterationOne batch processed

Q40: Input normalization for neural networks

BenefitDescription
Faster convergenceSimilar gradient magnitudes
Prevents saturating activationsKeeps values in active range
Smoother optimization landscapeEasier to navigate
Allows higher learning ratesMore aggressive updates possible

Q41: Data leakage

Using information during training that will not be available at prediction time. Model learns to cheat and fails in production.

PreventionMethod
Time-based splitsNever train on the future
Fit on training onlyScalers and encoders fit on training data, then transform test
Feature auditFor each feature, verify it can be computed before seeing the outcome

Q42: SVM vs logistic regression

Use SVM when:

  • Decision boundary is non-linear (with kernels)
  • High-dimensional sparse data (text)
  • Margin maximization is important
  • Small to medium datasets (SVMs do not scale well)

Q43: Kernel trick

Compute dot products in high-dimensional space without explicitly transforming data. Enables non-linear decision boundaries with linear algorithms.

Q44: Ensemble methods benefits

BenefitMechanism
Reduce varianceBagging
Reduce biasBoosting
Improve robustnessMultiple models
Combine strengthsDifferent model types

Q45: Parametric vs non-parametric statistics

TypeDescriptionExample
ParametricAssumes data distributiont-test assumes normal
Non-parametricNo distribution assumptionMann-Whitney

Use non-parametric when distribution is unknown or violated.

Table of Contents