Skip to main content

How to Answer ML System Design Questions

ML system design interviews typically last 45 minutes. A structured framework ensures coverage of all major components within the time constraint.

Framework Overview

PhaseDurationFocus
Problem clarification5-10 minRequirements, scale, constraints
Metrics definition5 minOffline and online metrics
Architecture10 minSystem components and data flow
Data and features10-15 minData sources, feature engineering
Model and training10-15 minAlgorithm selection, training approach
Serving and monitoring5-10 minDeployment, observability

Phase 1: Problem Clarification

Gather requirements before designing.

Business Context

Question CategoryExamples
GoalSpecific optimization target, not vague objectives
UsersUser segments, behavioral differences
Current stateExisting system or greenfield development

Functional Requirements

Question CategoryExamples
Input/OutputData format, response structure
Edge casesFirst-time users, missing data handling

Non-Functional Requirements

RequirementConsiderations
ScaleRequests per second (100 vs 100,000 has different implications)
LatencyReal-time requirements vs batch processing tolerance
AccuracyRequired precision, acceptable error rate

Constraints

Constraint TypeExamples
Data availabilityAccessible data sources
PrivacyCompliance requirements, data restrictions
ComputeBudget limitations, infrastructure constraints

Phase 2: Metrics Definition

Offline Metrics

Measured on held-out test data:

Problem TypeMetrics
ClassificationPrecision, recall, F1, AUC-ROC, AUC-PR
RankingNDCG, MRR, MAP
RegressionMSE, MAE, MAPE

Online Metrics

Measured in production:

CategoryExamples
Business outcomesCTR, conversion, revenue
User experienceLatency, session length

Example: Recommendation System

Offline: Precision@10, NDCG - ranking quality of relevant items
Online: CTR, conversion rate, revenue per session - business impact

Offline and online metrics do not always correlate. A model with strong test set performance may underperform in production.

Phase 3: High-Level Architecture

Document the system components and data flow.

Loading diagram...

Architecture Decisions

DecisionConsiderations
Inference modeOnline (real-time) vs batch (precomputed) based on latency requirements and data freshness needs
Feature computationReal-time features vs pre-computed features
Model stagesSingle model vs candidate generation + ranking

Phase 4: Data Pipeline

Data Sources

ConsiderationQuestions
Required dataSpecific data types needed
Storage locationDatabases, logs, external APIs
Freshness requirementsReal-time streaming vs daily batch

Data Processing

TaskApproach
Missing valuesImputation, deletion, or flagging
OutliersDetection and handling strategy
JoinsCross-source joins without leakage

Feature Engineering

CategoryExamples
User featuresDemographics, account age, engagement patterns
Item featuresCategory, price, rating, popularity
Context featuresTime of day, device type, location

Feature Computation

TypeStorageExample
BatchOffline storeLifetime value
StreamingOnline storeRecent click count
On-demandComputed at requestCurrent time features

Phase 5: Model Selection and Training

Baseline Approach

Start with simple models before adding complexity.

Loading diagram...

A logistic regression baseline provides a performance floor and indicates problem learnability.

Model Selection Criteria

FactorConsideration
Problem typeClassification, regression, ranking
Data volumeDeep learning requires large datasets
Latency constraintsModel complexity affects inference time
InterpretabilitySome applications require explainable predictions

Common Model Choices

ProblemBaselineProduction Model
ClassificationLogistic RegressionXGBoost, Neural Network
RankingPointwise logisticLearning to Rank, Neural Ranker
RecommendationMatrix factorizationTwo-tower, Transformer
NLPTF-IDF + logisticBERT, fine-tuned LLM

Training Considerations

ConsiderationApproach
Data splitsTime-based splits for temporal problems
Class imbalanceSampling strategies, class weights
Hyperparameter tuningGrid search for small spaces, Bayesian optimization for large

Phase 6: Serving and Deployment

Serving Patterns

PatternDescriptionUse Case
OnlineReal-time predictionsFresh results, higher infrastructure cost
BatchPrecomputed predictionsSimpler infrastructure, stale results
HybridCombined approachPrecompute where possible, real-time for remainder

Infrastructure Components

ComponentPurpose
Model servingTensorFlow Serving, TorchServe, custom API
Feature servingFeature store for training-serving consistency
CachingCache predictions for frequent queries

Deployment Strategy

StrategyDescription
Shadow modeRun new model in parallel, compare outputs, serve old model
CanaryRoute small traffic percentage to new model, monitor metrics
A/B testStatistical comparison before full rollout

Always maintain rollback capability.

Phase 7: Monitoring and Iteration

Monitoring Categories

CategoryMetrics
ModelAccuracy, prediction distribution changes
DataMissing features, distribution drift
SystemLatency, error rates, throughput

Alert Configuration

Set thresholds for automated alerts (e.g., 5% precision drop triggers investigation).

Retraining Strategy

StrategyTrigger
ScheduledDaily, weekly, based on predictable patterns
TriggeredDrift threshold exceeded

Feedback Loop

  1. Collect production labels (clicks, conversions, complaints)
  2. Retrain with new data
  3. A/B test retrained model before deployment

Example: Spam Detection

Problem: Design a spam detection system for email.

Clarification:

  • Volume: 1 billion emails per day, 100K QPS peak
  • Latency: Under 100ms (classify before inbox delivery)
  • Constraints: False positives are costly (legitimate emails in spam create user frustration)
  • Target: 99%+ precision, 90%+ recall

Metrics:

  • Offline: Precision, Recall, F1, AUC-ROC
  • Online: User-reported spam rate, false positive complaints

Architecture:

  • Real-time classification at email arrival
  • Feature store with sender reputation scores
  • Daily batch retraining with new labeled data

Features:

CategoryFeatures
TextSubject and body TF-IDF, suspicious phrases
SenderReputation score, past spam rate, account age
StructureLink count, attachments, image-to-text ratio
BehaviorSending velocity, recipient patterns

Model:

  • Baseline: Logistic regression with TF-IDF (fast, interpretable)
  • Production: Ensemble of XGBoost (tabular features) + neural network (text embeddings)

Serving:

  • Real-time inference with 50ms target latency
  • Cached sender reputation lookups
  • Async computation of heavy text features where possible

Monitoring:

  • Daily precision and recall tracking by sender type
  • Alert if false positive rate exceeds 0.1%
  • Feature distribution monitoring for drift

Common Pitfalls

PitfallIssue
Premature complexityProposing deep learning without justification
Data assumptionsAssuming clean, labeled data availability
Single solutionPresenting one approach without discussing alternatives
Latency oversightDesigning models too slow for production requirements
Missing monitoringNo plan for post-launch observability