ML Concepts Interview Questions

This document covers common ML theory questions that appear in interviews, including expected answers and relevant context.

Supervised Learning

Q1: Difference between classification and regression

Aspect	Classification	Regression
Output	Discrete class labels	Continuous values
Examples	Spam/not spam, cat/dog/bird	House price, temperature
Loss functions	Cross-entropy	MSE, MAE
Shared algorithms	Decision trees, neural networks can perform both

Q2: Parametric vs non-parametric models

Aspect	Parametric	Non-parametric
Parameters	Fixed count regardless of data size	Grows with data
Examples	Linear regression, logistic regression, naive Bayes	KNN, decision trees, SVM with RBF kernel
Assumptions	Distribution assumptions about data	Fewer assumptions
Training/Inference	Faster	Slower
Complexity	May underfit complex patterns	Risk of overfitting

Q3: Logistic regression vs linear regression

Aspect	Linear Regression	Logistic Regression
Output	Continuous values	Probabilities (0-1)
Function	Linear: y = wx + b	Sigmoid: 1/(1 + e^(-z))
Loss	MSE	Log loss (cross-entropy)
Assumptions	Normal errors	No normality assumption
Bounds	Unbounded	Bounded [0, 1]

Q4: Decision tree splitting

Decision trees evaluate all features and possible split points, selecting the split that maximizes information gain or minimizes impurity.

Criterion	Use Case	Formula
Gini impurity	Classification	1 - Sum(pi^2)
Entropy/Information Gain	Classification	-Sum(pi * log2(pi))
Variance reduction	Regression	Variance before - weighted variance after

The algorithm is greedy and does not guarantee globally optimal splits.

Q5: Random forests vs single decision trees

Mechanism	Description
Bagging	Each tree trains on a bootstrap sample (random subset with replacement)
Feature randomization	Each split considers only a random subset of features
Ensemble averaging	Reduces variance while maintaining bias

Individual trees overfit in different ways; averaging cancels noise. Trade-off: reduced interpretability compared to single trees.

Ensemble Methods

Q6: Bagging vs boosting

Aspect	Bagging	Boosting
Training	Parallel, independent	Sequential, each learns from previous errors
Data sampling	Bootstrap (with replacement)	Weighted based on previous errors
Goal	Reduce variance	Reduce bias
Examples	Random Forest	XGBoost, AdaBoost
Overfitting risk	Lower	Higher with too many rounds

Q7: Gradient boosting mechanism

Start with initial prediction (often mean for regression)
Calculate residuals (errors) from current predictions
Fit new weak learner to predict residuals
Add new learner's predictions (scaled by learning rate) to ensemble
Repeat steps 2-4

Hyperparameter	Effect
Learning rate	Smaller values require more trees but improve generalization
Number of trees	More trees increase overfitting risk
Max depth	Shallower trees create simpler weak learners

Q8: XGBoost vs LightGBM

Aspect	XGBoost	LightGBM
Tree growth	Level-wise (complete level before next)	Leaf-wise (grows leaf with largest loss reduction)
Speed	Generally slower	Generally faster, especially on large datasets
Categorical features	Requires encoding	Native handling
Memory	Standard	More efficient with histogram-based approach
Common features	Both support regularization, early stopping, GPU training

Neural Networks

Q9: Backpropagation

Forward pass: Compute predictions by passing input through network
Calculate loss: Measure prediction error
Backward pass: Compute gradients using chain rule, starting from loss and working backwards through each layer
- Gradient formula: dL/dw = dL/doutput * doutput/dw
Update weights: w_new = w_old - learning_rate * gradient

The chain rule enables computing all gradients in one backward pass.

Q10: Vanishing gradient problem

In deep networks, gradients can become extremely small in early layers.

Cause	Description
Sigmoid/tanh activation	Derivatives less than 1
Deep networks	Many layers multiply small values

Symptom	Description
Early layers not learning	Gradients too small for meaningful updates
Training stalls	Loss stops decreasing

Solution	Mechanism
ReLU activation	Gradient is 1 for positive values
Batch normalization	Normalizes layer inputs
Residual connections	Gradient flows directly through skip connections
Better initialization	Xavier/He initialization

Q11: Activation functions comparison

Function	Formula	Pros	Cons	Use Case
Sigmoid	1/(1+e^-x)	Bounded (0,1), interpretable	Vanishing gradients, not zero-centered	Output layer (binary)
Tanh	(e^x - e^-x)/(e^x + e^-x)	Zero-centered	Vanishing gradients	RNNs (historically)
ReLU	max(0, x)	No vanishing gradient, fast	Dead neurons, not zero-centered	Hidden layers (default)
Leaky ReLU	max(0.01x, x)	No dead neurons	Slightly more complex	Alternative to ReLU
Softmax	e^xi / Sum(e^xj)	Probabilities sum to 1	Only for output	Multi-class output

Q12: Dropout mechanism

During training, randomly zero out a fraction of neurons. Different neurons are dropped each forward pass. At test time, scale activations to compensate (or use inverted dropout during training).

Mechanism	Effect
Prevents co-adaptation	No neuron can rely on specific other neurons
Ensemble effect	Training exponentially many different sub-networks
Regularization	Noise prevents overfitting

Typical rates: 0.2-0.5 for hidden layers, lower for input layers.

Q13: Batch normalization

Normalizes layer inputs: x_norm = (x - mu_batch) / sigma_batch

Includes learnable scale (gamma) and shift (beta) parameters. Applied after linear transformation, before activation.

Benefit	Description
Faster training	Higher learning rates possible
Reduces internal covariate shift	Stabilizes distribution of inputs
Regularization effect	Batch statistics add noise
Reduces initialization sensitivity	Less dependent on initial weights

At inference: Uses running mean/variance from training, not batch statistics.

Regularization

Q14: L1 vs L2 regularization

Aspect	L1 (Lasso)	L2 (Ridge)
Penalty	Sum(abs(w))	Sum(w^2)
Effect on weights	Pushes to exactly zero	Shrinks toward zero
Feature selection	Yes (sparse solutions)	No
Geometry	Diamond constraint	Circle constraint
Correlated features	Picks one arbitrarily	Spreads weight among them
Use case	Feature selection, interpretability	When all features might matter

Q15: Early stopping

Stop training when validation loss stops improving. Monitor validation metric and stop after N epochs with no improvement.

Benefit	Description
Limits model complexity	Without adding explicit hyperparameter
Prevents overfitting	Stops before model memorizes training noise
Computational savings	No wasted training epochs

Early training: Model learns real patterns. Later training: Model starts memorizing noise.

Bias-Variance Trade-off

Q16: Bias-variance decomposition

Total Error = Bias^2 + Variance + Irreducible Noise

Component	Description
Bias	Error from overly simple assumptions. Model consistently misses pattern.
Variance	Error from sensitivity to training data. Model changes significantly with different samples.

Simple Model	Complex Model
High Bias	Low Bias
Low Variance	High Variance
Underfitting	Overfitting

Increasing model complexity decreases bias but increases variance. Goal: find the minimum total error.

Q17: Diagnosing high bias vs high variance

Metric	High Bias	High Variance
Training error	High	Low
Validation error	High	High
Gap between train/val	Small	Large

Condition	Solutions
High bias	More features, more complex model, less regularization, longer training
High variance	More data, simpler model, more regularization, dropout, early stopping

Model Evaluation

Q18: ROC-AUC vs precision-recall AUC

Metric	Description	Best For
ROC-AUC	True positive rate vs false positive rate	Balanced classes
PR-AUC	Precision vs recall	Imbalanced datasets, focus on minority class

With imbalanced data, ROC-AUC can appear high even with poor minority class performance. PR-AUC is more sensitive to improvements on the minority class.

Q19: Precision, recall, and F1 score

Metric	Formula	Interpretation	Optimize When
Precision	TP / (TP + FP)	Of predictions positive, how many correct	False positives are costly (spam filter)
Recall	TP / (TP + FN)	Of actual positives, how many found	False negatives are costly (cancer detection)
F1	2 * (Precision * Recall) / (Precision + Recall)	Harmonic mean balancing both	Both matter equally

Q20: K-fold cross-validation

Split data into k folds
Train on k-1 folds, validate on 1
Rotate and repeat k times
Average results

Benefit	Description
Reliable estimate	More stable than single split
Full data usage	All data used for both training and validation
Reduced variance	Performance estimate has lower variance

Typical k: 5 or 10. Stratified k-fold maintains class distribution in each fold.

Feature Engineering

Q21: Handling missing values

Strategy	When to Use	Trade-offs
Drop rows	Few missing, truly random	Simple but loses data
Drop columns	More than 50% missing	Loses entire feature
Mean/median imputation	Numerical, random missing	Simple but reduces variance
Mode imputation	Categorical	Simple but may add bias
Model-based (KNN, regression)	Missing depends on other features	Captures relationships but can overfit
Missing indicator	Missingness itself is informative	Preserves signal, adds features

Consider whether data is missing randomly or with a pattern. If users with high income skip the income field, that is not random.

Q22: Handling categorical variables

Method	When to Use	Pros	Cons
One-hot encoding	Low cardinality, nominal	No ordering assumed	High dimensions
Label encoding	Ordinal, tree models	Compact	Implies ordering
Target encoding	High cardinality	Compact, uses target	Leakage risk
Frequency encoding	Many categories	No leakage	Loses information
Embedding	Very high cardinality	Learns representations	Requires neural network

Always encode after train/test split to prevent leakage.

Q23: Feature scaling

When to scale:

Distance-based algorithms (KNN, SVM, k-means)
Gradient descent optimization (neural networks, logistic regression)
Regularized models (Lasso, Ridge)

When not needed:

Tree-based models (random forest, XGBoost) - splits are scale-invariant

Method	Formula	Use Case
StandardScaler	(x - mean) / std	Assumes normal distribution
MinMaxScaler	(x - min) / (max - min)	Scales to [0, 1]
RobustScaler	Uses median and IQR	Robust to outliers

Always fit scaler on training data only, then transform both train and test.

Imbalanced Data

Q24: Class imbalance handling strategies

Category	Techniques
Resampling	Oversample minority (SMOTE, ADASYN), undersample majority (Random, Tomek links), combination (SMOTE + Tomek)
Algorithm-level	Class weights (penalize minority misclassification more), cost-sensitive learning
Threshold adjustment	Default 0.5 may not be optimal; tune based on precision-recall trade-off
Evaluation	Use precision, recall, F1, PR-AUC instead of accuracy
Ensemble methods	BalancedRandomForest, EasyEnsemble

Q25: SMOTE (Synthetic Minority Over-sampling Technique)

For each minority sample, find k nearest minority neighbors
Randomly select one neighbor
Create synthetic sample along the line between the two points
x_synthetic = x_original + rand(0,1) * (x_neighbor - x_original)

Advantage	Limitation
Creates new examples rather than duplicating	Can create noisy samples if minority class is scattered
Helps classifier learn decision boundaries	Does not consider majority class - can create overlap

Variants: SMOTE-ENN, SMOTE-Tomek, ADASYN.

Dimensionality Reduction

Q26: PCA (Principal Component Analysis)

Center the data (subtract mean)
Compute covariance matrix
Find eigenvectors (principal components) and eigenvalues
Project data onto top k eigenvectors

Property	Description
Components are orthogonal	Capture independent sources of variation
First component	Captures the most variance
Eigenvalue	Indicates how much variance that component explains

Use Case	Description
Visualization	Reduce 100 features to 2-3 for plotting
Remove multicollinearity	Before regression
Noise reduction	Small components are often noise
Speed up training	Fewer features

Limitation	Description
Linear only	Cannot capture non-linear relationships
Interpretability	Principal components are not original features
Variance assumption	Assumes variance equals importance

Q27: PCA vs t-SNE

Aspect	PCA	t-SNE
Goal	Maximize variance	Preserve local structure
Type	Linear	Non-linear
Deterministic	Yes	No (stochastic)
Scalability	Fast, scales well	Slow, O(n^2)
Inverse transform	Yes	No
Use case	Preprocessing, speed	Visualization

t-SNE focuses on keeping similar points close, not preserving global structure.

Unsupervised Learning

Q28: K-means clustering

Initialize k centroids (randomly or k-means++)
Assign each point to nearest centroid
Update centroids to mean of assigned points
Repeat 2-3 until convergence

Choosing k	Method
Elbow method	Plot inertia vs k, look for elbow
Silhouette score	Measures cluster cohesion vs separation
Domain knowledge	Based on problem requirements

Limitation	Description
Assumes spherical clusters	Does not handle irregular shapes well
Sensitive to initialization	Use k-means++
Must specify k upfront	Cannot discover number of clusters
Sensitive to outliers	Can skew centroid positions

Q29: Hierarchical clustering vs k-means

Use Hierarchical When	Use K-means When
Number of clusters unknown	k is known or estimable
Exploring different granularities (dendrogram)	Large datasets (hierarchical is O(n^3))
Clusters may have irregular shapes	Spherical clusters expected
Deterministic results required	Speed matters

Advanced Topics

Q30: Attention mechanism in transformers

Attention computes weighted sum of values based on query-key similarity:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Component	Description
Query (Q)	What to look for
Key (K)	What to match against
Value (V)	What to retrieve
Scaling (sqrt(d_k))	Prevents softmax saturation with large dimensions

Type	Description
Self-attention	Q, K, V all come from same sequence
Multi-head	Multiple attention computations in parallel, captures different relationships

Benefit	Description
Long-range dependencies	Unlike RNNs
Parallelizable	Unlike sequential RNNs
Interpretable	Attention weights show what model focuses on

Q31: Word2Vec vs BERT embeddings

Aspect	Word2Vec	BERT
Context	Static (one embedding per word)	Contextual (varies by sentence)
Architecture	Shallow (1-2 layers)	Deep transformer (12-24 layers)
Training	Skip-gram or CBOW	Masked language modeling
Same word handling	Same embedding for "river bank" and "bank account"	Different based on context
Compute	Fast	Heavy

BERT produces contextual embeddings where the same word gets different representations based on surrounding words.

Q32: GAN (Generative Adversarial Network)

Two networks:

Generator (G): Creates fake samples from random noise
Discriminator (D): Distinguishes real from fake

Training (minimax game):

D tries to maximize: correctly classify real vs fake
G tries to minimize: fool D into thinking fakes are real

min_G max_D V(D,G) = E[log(D(x))] + E[log(1-D(G(z)))]

Challenge	Description
Mode collapse	G produces limited variety
Training instability	Hard to balance G and D
No convergence guarantee	May not reach equilibrium

Variants: DCGAN (convolutional), WGAN (Wasserstein loss), StyleGAN (style-based).

Model Deployment

Q33: Model drift types and detection

Drift Type	Description
Data drift	Input distribution changes; features look different than training
Concept drift	Relationship between inputs and outputs changes; what used to predict fraud no longer does
Label drift	Target distribution changes; fraud rate goes from 1% to 5%

Detection Method	Description
Monitor feature distributions	KL divergence, PSI
Track prediction distribution	Are predictions changing?
Monitor performance metrics	Is accuracy dropping?
Statistical tests	Compare recent data to training data

Response	Description
Retrain on recent data	Update model with new patterns
Add new features	Capture the change
Triggered retraining	Retrain when drift exceeds threshold
Online learning	Continuous adaptation

Q34: Batch vs online prediction

Factor	Batch	Online
Latency requirement	Can wait (minutes/hours)	Needed immediately (ms)
Volume	Many predictions at once	One at a time
Infrastructure	Simpler, scheduled jobs	API, always-on service
Feature freshness	Point-in-time features	Real-time features
Cost	Cheaper for large volume	Pay per request

Use Case	Prediction Mode
Recommendation precomputation	Batch
Churn prediction	Batch
Report generation	Batch
Fraud detection	Online
Search ranking	Online
Real-time bidding	Online

Q35: A/B testing in ML

Process:

Define metric (conversion, engagement, revenue)
Randomly split users into control (old model) and treatment (new model)
Collect data for statistical significance
Analyze results, make decision

Pitfall	Description
Peeking	Checking results before reaching sample size
Multiple testing	Testing many variants inflates false positives
Network effects	Users influence each other (use cluster randomization)
Novelty/primacy effects	Short-term behavior differs from long-term
Simpson's paradox	Aggregate results hide segment differences

Sample size calculation based on baseline rate, minimum detectable effect, and power (usually 80%).

Quick Reference

Q36: Curse of dimensionality

As dimensions increase, data becomes sparse. In 1D, 10 points might cover the space. In 100D, astronomical amounts of data are required. Distances become meaningless; everything is far from everything. KNN and other distance-based methods break down.

Q37: Generative vs discriminative models

Type	Models	Learns
Generative	Naive Bayes, GANs	P(X
Discriminative	Logistic regression, SVM	P(Y

Q38: Transfer learning

Using a model trained on one task as starting point for another. Freeze early layers (general features), fine-tune later layers (task-specific).

Q39: Epoch, batch, and iteration

Term	Definition
Epoch	One pass through entire dataset
Batch	Subset of data for one forward/backward pass
Iteration	One batch processed

Q40: Input normalization for neural networks

Benefit	Description
Faster convergence	Similar gradient magnitudes
Prevents saturating activations	Keeps values in active range
Smoother optimization landscape	Easier to navigate
Allows higher learning rates	More aggressive updates possible

Q41: Data leakage

Using information during training that will not be available at prediction time. Model learns to cheat and fails in production.

Prevention	Method
Time-based splits	Never train on the future
Fit on training only	Scalers and encoders fit on training data, then transform test
Feature audit	For each feature, verify it can be computed before seeing the outcome

Q42: SVM vs logistic regression

Use SVM when:

Decision boundary is non-linear (with kernels)
High-dimensional sparse data (text)
Margin maximization is important
Small to medium datasets (SVMs do not scale well)

Q43: Kernel trick

Compute dot products in high-dimensional space without explicitly transforming data. Enables non-linear decision boundaries with linear algorithms.

Q44: Ensemble methods benefits

Benefit	Mechanism
Reduce variance	Bagging
Reduce bias	Boosting
Improve robustness	Multiple models
Combine strengths	Different model types

Q45: Parametric vs non-parametric statistics

Type	Description	Example
Parametric	Assumes data distribution	t-test assumes normal
Non-parametric	No distribution assumption	Mann-Whitney

Use non-parametric when distribution is unknown or violated.

Supervised Learning​

Q1: Difference between classification and regression​

Q2: Parametric vs non-parametric models​

Q3: Logistic regression vs linear regression​

Q4: Decision tree splitting​

Q5: Random forests vs single decision trees​

Ensemble Methods​

Q6: Bagging vs boosting​

Q7: Gradient boosting mechanism​

Q8: XGBoost vs LightGBM​

Neural Networks​

Q9: Backpropagation​

Q10: Vanishing gradient problem​

Q11: Activation functions comparison​

Q12: Dropout mechanism​

Q13: Batch normalization​

Regularization​

Q14: L1 vs L2 regularization​

Q15: Early stopping​

Bias-Variance Trade-off​

Q16: Bias-variance decomposition​

Q17: Diagnosing high bias vs high variance​

Model Evaluation​

Q18: ROC-AUC vs precision-recall AUC​

Q19: Precision, recall, and F1 score​

Q20: K-fold cross-validation​

Feature Engineering​

Q21: Handling missing values​

Q22: Handling categorical variables​

Q23: Feature scaling​

Imbalanced Data​

Q24: Class imbalance handling strategies​

Q25: SMOTE (Synthetic Minority Over-sampling Technique)​

Dimensionality Reduction​

Q26: PCA (Principal Component Analysis)​

Q27: PCA vs t-SNE​

Unsupervised Learning​

Q28: K-means clustering​

Q29: Hierarchical clustering vs k-means​

Advanced Topics​

Q30: Attention mechanism in transformers​

Q31: Word2Vec vs BERT embeddings​

Q32: GAN (Generative Adversarial Network)​

Model Deployment​

Q33: Model drift types and detection​

Q34: Batch vs online prediction​

Q35: A/B testing in ML​

Quick Reference​

Q36: Curse of dimensionality​

Q37: Generative vs discriminative models​

Q38: Transfer learning​

Q39: Epoch, batch, and iteration​

Q40: Input normalization for neural networks​

Q41: Data leakage​

Q42: SVM vs logistic regression​

Q43: Kernel trick​

Q44: Ensemble methods benefits​

Q45: Parametric vs non-parametric statistics​

Table of Contents