How to Answer ML System Design Questions

ML system design interviews typically last 45 minutes. A structured framework ensures coverage of all major components within the time constraint.

Framework Overview

Phase	Duration	Focus
Problem clarification	5-10 min	Requirements, scale, constraints
Metrics definition	5 min	Offline and online metrics
Architecture	10 min	System components and data flow
Data and features	10-15 min	Data sources, feature engineering
Model and training	10-15 min	Algorithm selection, training approach
Serving and monitoring	5-10 min	Deployment, observability

Phase 1: Problem Clarification

Gather requirements before designing.

Business Context

Question Category	Examples
Goal	Specific optimization target, not vague objectives
Users	User segments, behavioral differences
Current state	Existing system or greenfield development

Functional Requirements

Question Category	Examples
Input/Output	Data format, response structure
Edge cases	First-time users, missing data handling

Non-Functional Requirements

Requirement	Considerations
Scale	Requests per second (100 vs 100,000 has different implications)
Latency	Real-time requirements vs batch processing tolerance
Accuracy	Required precision, acceptable error rate

Constraints

Constraint Type	Examples
Data availability	Accessible data sources
Privacy	Compliance requirements, data restrictions
Compute	Budget limitations, infrastructure constraints

Phase 2: Metrics Definition

Offline Metrics

Measured on held-out test data:

Problem Type	Metrics
Classification	Precision, recall, F1, AUC-ROC, AUC-PR
Ranking	NDCG, MRR, MAP
Regression	MSE, MAE, MAPE

Online Metrics

Measured in production:

Category	Examples
Business outcomes	CTR, conversion, revenue
User experience	Latency, session length

Example: Recommendation System

Offline: Precision@10, NDCG - ranking quality of relevant items
Online: CTR, conversion rate, revenue per session - business impact

Offline and online metrics do not always correlate. A model with strong test set performance may underperform in production.

Phase 3: High-Level Architecture

Document the system components and data flow.

Loading diagram...

Architecture Decisions

Decision	Considerations
Inference mode	Online (real-time) vs batch (precomputed) based on latency requirements and data freshness needs
Feature computation	Real-time features vs pre-computed features
Model stages	Single model vs candidate generation + ranking

Phase 4: Data Pipeline

Data Sources

Consideration	Questions
Required data	Specific data types needed
Storage location	Databases, logs, external APIs
Freshness requirements	Real-time streaming vs daily batch

Data Processing

Task	Approach
Missing values	Imputation, deletion, or flagging
Outliers	Detection and handling strategy
Joins	Cross-source joins without leakage

Feature Engineering

Category	Examples
User features	Demographics, account age, engagement patterns
Item features	Category, price, rating, popularity
Context features	Time of day, device type, location

Feature Computation

Type	Storage	Example
Batch	Offline store	Lifetime value
Streaming	Online store	Recent click count
On-demand	Computed at request	Current time features

Phase 5: Model Selection and Training

Baseline Approach

Start with simple models before adding complexity.

Loading diagram...

A logistic regression baseline provides a performance floor and indicates problem learnability.

Model Selection Criteria

Factor	Consideration
Problem type	Classification, regression, ranking
Data volume	Deep learning requires large datasets
Latency constraints	Model complexity affects inference time
Interpretability	Some applications require explainable predictions

Common Model Choices

Problem	Baseline	Production Model
Classification	Logistic Regression	XGBoost, Neural Network
Ranking	Pointwise logistic	Learning to Rank, Neural Ranker
Recommendation	Matrix factorization	Two-tower, Transformer
NLP	TF-IDF + logistic	BERT, fine-tuned LLM

Training Considerations

Consideration	Approach
Data splits	Time-based splits for temporal problems
Class imbalance	Sampling strategies, class weights
Hyperparameter tuning	Grid search for small spaces, Bayesian optimization for large

Phase 6: Serving and Deployment

Serving Patterns

Pattern	Description	Use Case
Online	Real-time predictions	Fresh results, higher infrastructure cost
Batch	Precomputed predictions	Simpler infrastructure, stale results
Hybrid	Combined approach	Precompute where possible, real-time for remainder

Infrastructure Components

Component	Purpose
Model serving	TensorFlow Serving, TorchServe, custom API
Feature serving	Feature store for training-serving consistency
Caching	Cache predictions for frequent queries

Deployment Strategy

Strategy	Description
Shadow mode	Run new model in parallel, compare outputs, serve old model
Canary	Route small traffic percentage to new model, monitor metrics
A/B test	Statistical comparison before full rollout

Always maintain rollback capability.

Phase 7: Monitoring and Iteration

Monitoring Categories

Category	Metrics
Model	Accuracy, prediction distribution changes
Data	Missing features, distribution drift
System	Latency, error rates, throughput

Alert Configuration

Set thresholds for automated alerts (e.g., 5% precision drop triggers investigation).

Retraining Strategy

Strategy	Trigger
Scheduled	Daily, weekly, based on predictable patterns
Triggered	Drift threshold exceeded

Feedback Loop

Collect production labels (clicks, conversions, complaints)
Retrain with new data
A/B test retrained model before deployment

Example: Spam Detection

Problem: Design a spam detection system for email.

Clarification:

Volume: 1 billion emails per day, 100K QPS peak
Latency: Under 100ms (classify before inbox delivery)
Constraints: False positives are costly (legitimate emails in spam create user frustration)
Target: 99%+ precision, 90%+ recall

Metrics:

Offline: Precision, Recall, F1, AUC-ROC
Online: User-reported spam rate, false positive complaints

Architecture:

Real-time classification at email arrival
Feature store with sender reputation scores
Daily batch retraining with new labeled data

Features:

Category	Features
Text	Subject and body TF-IDF, suspicious phrases
Sender	Reputation score, past spam rate, account age
Structure	Link count, attachments, image-to-text ratio
Behavior	Sending velocity, recipient patterns

Model:

Baseline: Logistic regression with TF-IDF (fast, interpretable)
Production: Ensemble of XGBoost (tabular features) + neural network (text embeddings)

Serving:

Real-time inference with 50ms target latency
Cached sender reputation lookups
Async computation of heavy text features where possible

Monitoring:

Daily precision and recall tracking by sender type
Alert if false positive rate exceeds 0.1%
Feature distribution monitoring for drift

Common Pitfalls

Pitfall	Issue
Premature complexity	Proposing deep learning without justification
Data assumptions	Assuming clean, labeled data availability
Single solution	Presenting one approach without discussing alternatives
Latency oversight	Designing models too slow for production requirements
Missing monitoring	No plan for post-launch observability

Framework Overview​

Phase 1: Problem Clarification​

Business Context​

Functional Requirements​

Non-Functional Requirements​

Constraints​

Phase 2: Metrics Definition​

Offline Metrics​

Online Metrics​

Example: Recommendation System​

Phase 3: High-Level Architecture​

Architecture Decisions​

Phase 4: Data Pipeline​

Data Sources​

Data Processing​

Feature Engineering​

Feature Computation​

Phase 5: Model Selection and Training​

Baseline Approach​

Model Selection Criteria​

Common Model Choices​

Training Considerations​

Phase 6: Serving and Deployment​

Serving Patterns​

Infrastructure Components​

Deployment Strategy​

Phase 7: Monitoring and Iteration​

Monitoring Categories​

Alert Configuration​

Retraining Strategy​

Feedback Loop​

Example: Spam Detection​

Common Pitfalls​

Table of Contents