ML Concepts Interview Questions
This document covers common ML theory questions that appear in interviews, including expected answers and relevant context.
| Aspect | Classification | Regression |
|---|
| Output | Discrete class labels | Continuous values |
| Examples | Spam/not spam, cat/dog/bird | House price, temperature |
| Loss functions | Cross-entropy | MSE, MAE |
| Shared algorithms | Decision trees, neural networks can perform both | |
| Aspect | Parametric | Non-parametric |
|---|
| Parameters | Fixed count regardless of data size | Grows with data |
| Examples | Linear regression, logistic regression, naive Bayes | KNN, decision trees, SVM with RBF kernel |
| Assumptions | Distribution assumptions about data | Fewer assumptions |
| Training/Inference | Faster | Slower |
| Complexity | May underfit complex patterns | Risk of overfitting |
| Aspect | Linear Regression | Logistic Regression |
|---|
| Output | Continuous values | Probabilities (0-1) |
| Function | Linear: y = wx + b | Sigmoid: 1/(1 + e^(-z)) |
| Loss | MSE | Log loss (cross-entropy) |
| Assumptions | Normal errors | No normality assumption |
| Bounds | Unbounded | Bounded [0, 1] |
Decision trees evaluate all features and possible split points, selecting the split that maximizes information gain or minimizes impurity.
| Criterion | Use Case | Formula |
|---|
| Gini impurity | Classification | 1 - Sum(pi^2) |
| Entropy/Information Gain | Classification | -Sum(pi * log2(pi)) |
| Variance reduction | Regression | Variance before - weighted variance after |
The algorithm is greedy and does not guarantee globally optimal splits.
| Mechanism | Description |
|---|
| Bagging | Each tree trains on a bootstrap sample (random subset with replacement) |
| Feature randomization | Each split considers only a random subset of features |
| Ensemble averaging | Reduces variance while maintaining bias |
Individual trees overfit in different ways; averaging cancels noise. Trade-off: reduced interpretability compared to single trees.
| Aspect | Bagging | Boosting |
|---|
| Training | Parallel, independent | Sequential, each learns from previous errors |
| Data sampling | Bootstrap (with replacement) | Weighted based on previous errors |
| Goal | Reduce variance | Reduce bias |
| Examples | Random Forest | XGBoost, AdaBoost |
| Overfitting risk | Lower | Higher with too many rounds |
- Start with initial prediction (often mean for regression)
- Calculate residuals (errors) from current predictions
- Fit new weak learner to predict residuals
- Add new learner's predictions (scaled by learning rate) to ensemble
- Repeat steps 2-4
| Hyperparameter | Effect |
|---|
| Learning rate | Smaller values require more trees but improve generalization |
| Number of trees | More trees increase overfitting risk |
| Max depth | Shallower trees create simpler weak learners |
| Aspect | XGBoost | LightGBM |
|---|
| Tree growth | Level-wise (complete level before next) | Leaf-wise (grows leaf with largest loss reduction) |
| Speed | Generally slower | Generally faster, especially on large datasets |
| Categorical features | Requires encoding | Native handling |
| Memory | Standard | More efficient with histogram-based approach |
| Common features | Both support regularization, early stopping, GPU training | |
- Forward pass: Compute predictions by passing input through network
- Calculate loss: Measure prediction error
- Backward pass: Compute gradients using chain rule, starting from loss and working backwards through each layer
- Gradient formula: dL/dw = dL/doutput * doutput/dw
- Update weights: w_new = w_old - learning_rate * gradient
The chain rule enables computing all gradients in one backward pass.
In deep networks, gradients can become extremely small in early layers.
| Cause | Description |
|---|
| Sigmoid/tanh activation | Derivatives less than 1 |
| Deep networks | Many layers multiply small values |
| Symptom | Description |
|---|
| Early layers not learning | Gradients too small for meaningful updates |
| Training stalls | Loss stops decreasing |
| Solution | Mechanism |
|---|
| ReLU activation | Gradient is 1 for positive values |
| Batch normalization | Normalizes layer inputs |
| Residual connections | Gradient flows directly through skip connections |
| Better initialization | Xavier/He initialization |
| Function | Formula | Pros | Cons | Use Case |
|---|
| Sigmoid | 1/(1+e^-x) | Bounded (0,1), interpretable | Vanishing gradients, not zero-centered | Output layer (binary) |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | Zero-centered | Vanishing gradients | RNNs (historically) |
| ReLU | max(0, x) | No vanishing gradient, fast | Dead neurons, not zero-centered | Hidden layers (default) |
| Leaky ReLU | max(0.01x, x) | No dead neurons | Slightly more complex | Alternative to ReLU |
| Softmax | e^xi / Sum(e^xj) | Probabilities sum to 1 | Only for output | Multi-class output |
During training, randomly zero out a fraction of neurons. Different neurons are dropped each forward pass. At test time, scale activations to compensate (or use inverted dropout during training).
| Mechanism | Effect |
|---|
| Prevents co-adaptation | No neuron can rely on specific other neurons |
| Ensemble effect | Training exponentially many different sub-networks |
| Regularization | Noise prevents overfitting |
Typical rates: 0.2-0.5 for hidden layers, lower for input layers.
Normalizes layer inputs: x_norm = (x - mu_batch) / sigma_batch
Includes learnable scale (gamma) and shift (beta) parameters. Applied after linear transformation, before activation.
| Benefit | Description |
|---|
| Faster training | Higher learning rates possible |
| Reduces internal covariate shift | Stabilizes distribution of inputs |
| Regularization effect | Batch statistics add noise |
| Reduces initialization sensitivity | Less dependent on initial weights |
At inference: Uses running mean/variance from training, not batch statistics.
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|
| Penalty | Sum(abs(w)) | Sum(w^2) |
| Effect on weights | Pushes to exactly zero | Shrinks toward zero |
| Feature selection | Yes (sparse solutions) | No |
| Geometry | Diamond constraint | Circle constraint |
| Correlated features | Picks one arbitrarily | Spreads weight among them |
| Use case | Feature selection, interpretability | When all features might matter |
Stop training when validation loss stops improving. Monitor validation metric and stop after N epochs with no improvement.
| Benefit | Description |
|---|
| Limits model complexity | Without adding explicit hyperparameter |
| Prevents overfitting | Stops before model memorizes training noise |
| Computational savings | No wasted training epochs |
Early training: Model learns real patterns. Later training: Model starts memorizing noise.
Total Error = Bias^2 + Variance + Irreducible Noise
| Component | Description |
|---|
| Bias | Error from overly simple assumptions. Model consistently misses pattern. |
| Variance | Error from sensitivity to training data. Model changes significantly with different samples. |
| Simple Model | Complex Model |
|---|
| High Bias | Low Bias |
| Low Variance | High Variance |
| Underfitting | Overfitting |
Increasing model complexity decreases bias but increases variance. Goal: find the minimum total error.
| Metric | High Bias | High Variance |
|---|
| Training error | High | Low |
| Validation error | High | High |
| Gap between train/val | Small | Large |
| Condition | Solutions |
|---|
| High bias | More features, more complex model, less regularization, longer training |
| High variance | More data, simpler model, more regularization, dropout, early stopping |
| Metric | Description | Best For |
|---|
| ROC-AUC | True positive rate vs false positive rate | Balanced classes |
| PR-AUC | Precision vs recall | Imbalanced datasets, focus on minority class |
With imbalanced data, ROC-AUC can appear high even with poor minority class performance. PR-AUC is more sensitive to improvements on the minority class.
| Metric | Formula | Interpretation | Optimize When |
|---|
| Precision | TP / (TP + FP) | Of predictions positive, how many correct | False positives are costly (spam filter) |
| Recall | TP / (TP + FN) | Of actual positives, how many found | False negatives are costly (cancer detection) |
| F1 | 2 * (Precision * Recall) / (Precision + Recall) | Harmonic mean balancing both | Both matter equally |
- Split data into k folds
- Train on k-1 folds, validate on 1
- Rotate and repeat k times
- Average results
| Benefit | Description |
|---|
| Reliable estimate | More stable than single split |
| Full data usage | All data used for both training and validation |
| Reduced variance | Performance estimate has lower variance |
Typical k: 5 or 10. Stratified k-fold maintains class distribution in each fold.
| Strategy | When to Use | Trade-offs |
|---|
| Drop rows | Few missing, truly random | Simple but loses data |
| Drop columns | More than 50% missing | Loses entire feature |
| Mean/median imputation | Numerical, random missing | Simple but reduces variance |
| Mode imputation | Categorical | Simple but may add bias |
| Model-based (KNN, regression) | Missing depends on other features | Captures relationships but can overfit |
| Missing indicator | Missingness itself is informative | Preserves signal, adds features |
Consider whether data is missing randomly or with a pattern. If users with high income skip the income field, that is not random.
| Method | When to Use | Pros | Cons |
|---|
| One-hot encoding | Low cardinality, nominal | No ordering assumed | High dimensions |
| Label encoding | Ordinal, tree models | Compact | Implies ordering |
| Target encoding | High cardinality | Compact, uses target | Leakage risk |
| Frequency encoding | Many categories | No leakage | Loses information |
| Embedding | Very high cardinality | Learns representations | Requires neural network |
Always encode after train/test split to prevent leakage.
When to scale:
- Distance-based algorithms (KNN, SVM, k-means)
- Gradient descent optimization (neural networks, logistic regression)
- Regularized models (Lasso, Ridge)
When not needed:
- Tree-based models (random forest, XGBoost) - splits are scale-invariant
| Method | Formula | Use Case |
|---|
| StandardScaler | (x - mean) / std | Assumes normal distribution |
| MinMaxScaler | (x - min) / (max - min) | Scales to [0, 1] |
| RobustScaler | Uses median and IQR | Robust to outliers |
Always fit scaler on training data only, then transform both train and test.
| Category | Techniques |
|---|
| Resampling | Oversample minority (SMOTE, ADASYN), undersample majority (Random, Tomek links), combination (SMOTE + Tomek) |
| Algorithm-level | Class weights (penalize minority misclassification more), cost-sensitive learning |
| Threshold adjustment | Default 0.5 may not be optimal; tune based on precision-recall trade-off |
| Evaluation | Use precision, recall, F1, PR-AUC instead of accuracy |
| Ensemble methods | BalancedRandomForest, EasyEnsemble |
- For each minority sample, find k nearest minority neighbors
- Randomly select one neighbor
- Create synthetic sample along the line between the two points
- x_synthetic = x_original + rand(0,1) * (x_neighbor - x_original)
| Advantage | Limitation |
|---|
| Creates new examples rather than duplicating | Can create noisy samples if minority class is scattered |
| Helps classifier learn decision boundaries | Does not consider majority class - can create overlap |
Variants: SMOTE-ENN, SMOTE-Tomek, ADASYN.
- Center the data (subtract mean)
- Compute covariance matrix
- Find eigenvectors (principal components) and eigenvalues
- Project data onto top k eigenvectors
| Property | Description |
|---|
| Components are orthogonal | Capture independent sources of variation |
| First component | Captures the most variance |
| Eigenvalue | Indicates how much variance that component explains |
| Use Case | Description |
|---|
| Visualization | Reduce 100 features to 2-3 for plotting |
| Remove multicollinearity | Before regression |
| Noise reduction | Small components are often noise |
| Speed up training | Fewer features |
| Limitation | Description |
|---|
| Linear only | Cannot capture non-linear relationships |
| Interpretability | Principal components are not original features |
| Variance assumption | Assumes variance equals importance |
| Aspect | PCA | t-SNE |
|---|
| Goal | Maximize variance | Preserve local structure |
| Type | Linear | Non-linear |
| Deterministic | Yes | No (stochastic) |
| Scalability | Fast, scales well | Slow, O(n^2) |
| Inverse transform | Yes | No |
| Use case | Preprocessing, speed | Visualization |
t-SNE focuses on keeping similar points close, not preserving global structure.
- Initialize k centroids (randomly or k-means++)
- Assign each point to nearest centroid
- Update centroids to mean of assigned points
- Repeat 2-3 until convergence
| Choosing k | Method |
|---|
| Elbow method | Plot inertia vs k, look for elbow |
| Silhouette score | Measures cluster cohesion vs separation |
| Domain knowledge | Based on problem requirements |
| Limitation | Description |
|---|
| Assumes spherical clusters | Does not handle irregular shapes well |
| Sensitive to initialization | Use k-means++ |
| Must specify k upfront | Cannot discover number of clusters |
| Sensitive to outliers | Can skew centroid positions |
| Use Hierarchical When | Use K-means When |
|---|
| Number of clusters unknown | k is known or estimable |
| Exploring different granularities (dendrogram) | Large datasets (hierarchical is O(n^3)) |
| Clusters may have irregular shapes | Spherical clusters expected |
| Deterministic results required | Speed matters |
Attention computes weighted sum of values based on query-key similarity:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
| Component | Description |
|---|
| Query (Q) | What to look for |
| Key (K) | What to match against |
| Value (V) | What to retrieve |
| Scaling (sqrt(d_k)) | Prevents softmax saturation with large dimensions |
| Type | Description |
|---|
| Self-attention | Q, K, V all come from same sequence |
| Multi-head | Multiple attention computations in parallel, captures different relationships |
| Benefit | Description |
|---|
| Long-range dependencies | Unlike RNNs |
| Parallelizable | Unlike sequential RNNs |
| Interpretable | Attention weights show what model focuses on |
| Aspect | Word2Vec | BERT |
|---|
| Context | Static (one embedding per word) | Contextual (varies by sentence) |
| Architecture | Shallow (1-2 layers) | Deep transformer (12-24 layers) |
| Training | Skip-gram or CBOW | Masked language modeling |
| Same word handling | Same embedding for "river bank" and "bank account" | Different based on context |
| Compute | Fast | Heavy |
BERT produces contextual embeddings where the same word gets different representations based on surrounding words.
Two networks:
- Generator (G): Creates fake samples from random noise
- Discriminator (D): Distinguishes real from fake
Training (minimax game):
- D tries to maximize: correctly classify real vs fake
- G tries to minimize: fool D into thinking fakes are real
min_G max_D V(D,G) = E[log(D(x))] + E[log(1-D(G(z)))]
| Challenge | Description |
|---|
| Mode collapse | G produces limited variety |
| Training instability | Hard to balance G and D |
| No convergence guarantee | May not reach equilibrium |
Variants: DCGAN (convolutional), WGAN (Wasserstein loss), StyleGAN (style-based).
| Drift Type | Description |
|---|
| Data drift | Input distribution changes; features look different than training |
| Concept drift | Relationship between inputs and outputs changes; what used to predict fraud no longer does |
| Label drift | Target distribution changes; fraud rate goes from 1% to 5% |
| Detection Method | Description |
|---|
| Monitor feature distributions | KL divergence, PSI |
| Track prediction distribution | Are predictions changing? |
| Monitor performance metrics | Is accuracy dropping? |
| Statistical tests | Compare recent data to training data |
| Response | Description |
|---|
| Retrain on recent data | Update model with new patterns |
| Add new features | Capture the change |
| Triggered retraining | Retrain when drift exceeds threshold |
| Online learning | Continuous adaptation |
| Factor | Batch | Online |
|---|
| Latency requirement | Can wait (minutes/hours) | Needed immediately (ms) |
| Volume | Many predictions at once | One at a time |
| Infrastructure | Simpler, scheduled jobs | API, always-on service |
| Feature freshness | Point-in-time features | Real-time features |
| Cost | Cheaper for large volume | Pay per request |
| Use Case | Prediction Mode |
|---|
| Recommendation precomputation | Batch |
| Churn prediction | Batch |
| Report generation | Batch |
| Fraud detection | Online |
| Search ranking | Online |
| Real-time bidding | Online |
Process:
- Define metric (conversion, engagement, revenue)
- Randomly split users into control (old model) and treatment (new model)
- Collect data for statistical significance
- Analyze results, make decision
| Pitfall | Description |
|---|
| Peeking | Checking results before reaching sample size |
| Multiple testing | Testing many variants inflates false positives |
| Network effects | Users influence each other (use cluster randomization) |
| Novelty/primacy effects | Short-term behavior differs from long-term |
| Simpson's paradox | Aggregate results hide segment differences |
Sample size calculation based on baseline rate, minimum detectable effect, and power (usually 80%).
As dimensions increase, data becomes sparse. In 1D, 10 points might cover the space. In 100D, astronomical amounts of data are required. Distances become meaningless; everything is far from everything. KNN and other distance-based methods break down.
| Type | Models | Learns |
|---|
| Generative | Naive Bayes, GANs | P(X |
| Discriminative | Logistic regression, SVM | P(Y |
Using a model trained on one task as starting point for another. Freeze early layers (general features), fine-tune later layers (task-specific).
| Term | Definition |
|---|
| Epoch | One pass through entire dataset |
| Batch | Subset of data for one forward/backward pass |
| Iteration | One batch processed |
| Benefit | Description |
|---|
| Faster convergence | Similar gradient magnitudes |
| Prevents saturating activations | Keeps values in active range |
| Smoother optimization landscape | Easier to navigate |
| Allows higher learning rates | More aggressive updates possible |
Using information during training that will not be available at prediction time. Model learns to cheat and fails in production.
| Prevention | Method |
|---|
| Time-based splits | Never train on the future |
| Fit on training only | Scalers and encoders fit on training data, then transform test |
| Feature audit | For each feature, verify it can be computed before seeing the outcome |
Use SVM when:
- Decision boundary is non-linear (with kernels)
- High-dimensional sparse data (text)
- Margin maximization is important
- Small to medium datasets (SVMs do not scale well)
Compute dot products in high-dimensional space without explicitly transforming data. Enables non-linear decision boundaries with linear algorithms.
| Benefit | Mechanism |
|---|
| Reduce variance | Bagging |
| Reduce bias | Boosting |
| Improve robustness | Multiple models |
| Combine strengths | Different model types |
| Type | Description | Example |
|---|
| Parametric | Assumes data distribution | t-test assumes normal |
| Non-parametric | No distribution assumption | Mann-Whitney |
Use non-parametric when distribution is unknown or violated.