How to Answer ML System Design Questions
ML system design interviews typically last 45 minutes. A structured framework ensures coverage of all major components within the time constraint.
Framework Overview
| Phase | Duration | Focus |
|---|---|---|
| Problem clarification | 5-10 min | Requirements, scale, constraints |
| Metrics definition | 5 min | Offline and online metrics |
| Architecture | 10 min | System components and data flow |
| Data and features | 10-15 min | Data sources, feature engineering |
| Model and training | 10-15 min | Algorithm selection, training approach |
| Serving and monitoring | 5-10 min | Deployment, observability |
Phase 1: Problem Clarification
Gather requirements before designing.
Business Context
| Question Category | Examples |
|---|---|
| Goal | Specific optimization target, not vague objectives |
| Users | User segments, behavioral differences |
| Current state | Existing system or greenfield development |
Functional Requirements
| Question Category | Examples |
|---|---|
| Input/Output | Data format, response structure |
| Edge cases | First-time users, missing data handling |
Non-Functional Requirements
| Requirement | Considerations |
|---|---|
| Scale | Requests per second (100 vs 100,000 has different implications) |
| Latency | Real-time requirements vs batch processing tolerance |
| Accuracy | Required precision, acceptable error rate |
Constraints
| Constraint Type | Examples |
|---|---|
| Data availability | Accessible data sources |
| Privacy | Compliance requirements, data restrictions |
| Compute | Budget limitations, infrastructure constraints |
Phase 2: Metrics Definition
Offline Metrics
Measured on held-out test data:
| Problem Type | Metrics |
|---|---|
| Classification | Precision, recall, F1, AUC-ROC, AUC-PR |
| Ranking | NDCG, MRR, MAP |
| Regression | MSE, MAE, MAPE |
Online Metrics
Measured in production:
| Category | Examples |
|---|---|
| Business outcomes | CTR, conversion, revenue |
| User experience | Latency, session length |
Example: Recommendation System
Offline: Precision@10, NDCG - ranking quality of relevant items
Online: CTR, conversion rate, revenue per session - business impact
Offline and online metrics do not always correlate. A model with strong test set performance may underperform in production.
Phase 3: High-Level Architecture
Document the system components and data flow.
Architecture Decisions
| Decision | Considerations |
|---|---|
| Inference mode | Online (real-time) vs batch (precomputed) based on latency requirements and data freshness needs |
| Feature computation | Real-time features vs pre-computed features |
| Model stages | Single model vs candidate generation + ranking |
Phase 4: Data Pipeline
Data Sources
| Consideration | Questions |
|---|---|
| Required data | Specific data types needed |
| Storage location | Databases, logs, external APIs |
| Freshness requirements | Real-time streaming vs daily batch |
Data Processing
| Task | Approach |
|---|---|
| Missing values | Imputation, deletion, or flagging |
| Outliers | Detection and handling strategy |
| Joins | Cross-source joins without leakage |
Feature Engineering
| Category | Examples |
|---|---|
| User features | Demographics, account age, engagement patterns |
| Item features | Category, price, rating, popularity |
| Context features | Time of day, device type, location |
Feature Computation
| Type | Storage | Example |
|---|---|---|
| Batch | Offline store | Lifetime value |
| Streaming | Online store | Recent click count |
| On-demand | Computed at request | Current time features |
Phase 5: Model Selection and Training
Baseline Approach
Start with simple models before adding complexity.
A logistic regression baseline provides a performance floor and indicates problem learnability.
Model Selection Criteria
| Factor | Consideration |
|---|---|
| Problem type | Classification, regression, ranking |
| Data volume | Deep learning requires large datasets |
| Latency constraints | Model complexity affects inference time |
| Interpretability | Some applications require explainable predictions |
Common Model Choices
| Problem | Baseline | Production Model |
|---|---|---|
| Classification | Logistic Regression | XGBoost, Neural Network |
| Ranking | Pointwise logistic | Learning to Rank, Neural Ranker |
| Recommendation | Matrix factorization | Two-tower, Transformer |
| NLP | TF-IDF + logistic | BERT, fine-tuned LLM |
Training Considerations
| Consideration | Approach |
|---|---|
| Data splits | Time-based splits for temporal problems |
| Class imbalance | Sampling strategies, class weights |
| Hyperparameter tuning | Grid search for small spaces, Bayesian optimization for large |
Phase 6: Serving and Deployment
Serving Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Online | Real-time predictions | Fresh results, higher infrastructure cost |
| Batch | Precomputed predictions | Simpler infrastructure, stale results |
| Hybrid | Combined approach | Precompute where possible, real-time for remainder |
Infrastructure Components
| Component | Purpose |
|---|---|
| Model serving | TensorFlow Serving, TorchServe, custom API |
| Feature serving | Feature store for training-serving consistency |
| Caching | Cache predictions for frequent queries |
Deployment Strategy
| Strategy | Description |
|---|---|
| Shadow mode | Run new model in parallel, compare outputs, serve old model |
| Canary | Route small traffic percentage to new model, monitor metrics |
| A/B test | Statistical comparison before full rollout |
Always maintain rollback capability.
Phase 7: Monitoring and Iteration
Monitoring Categories
| Category | Metrics |
|---|---|
| Model | Accuracy, prediction distribution changes |
| Data | Missing features, distribution drift |
| System | Latency, error rates, throughput |
Alert Configuration
Set thresholds for automated alerts (e.g., 5% precision drop triggers investigation).
Retraining Strategy
| Strategy | Trigger |
|---|---|
| Scheduled | Daily, weekly, based on predictable patterns |
| Triggered | Drift threshold exceeded |
Feedback Loop
- Collect production labels (clicks, conversions, complaints)
- Retrain with new data
- A/B test retrained model before deployment
Example: Spam Detection
Problem: Design a spam detection system for email.
Clarification:
- Volume: 1 billion emails per day, 100K QPS peak
- Latency: Under 100ms (classify before inbox delivery)
- Constraints: False positives are costly (legitimate emails in spam create user frustration)
- Target: 99%+ precision, 90%+ recall
Metrics:
- Offline: Precision, Recall, F1, AUC-ROC
- Online: User-reported spam rate, false positive complaints
Architecture:
- Real-time classification at email arrival
- Feature store with sender reputation scores
- Daily batch retraining with new labeled data
Features:
| Category | Features |
|---|---|
| Text | Subject and body TF-IDF, suspicious phrases |
| Sender | Reputation score, past spam rate, account age |
| Structure | Link count, attachments, image-to-text ratio |
| Behavior | Sending velocity, recipient patterns |
Model:
- Baseline: Logistic regression with TF-IDF (fast, interpretable)
- Production: Ensemble of XGBoost (tabular features) + neural network (text embeddings)
Serving:
- Real-time inference with 50ms target latency
- Cached sender reputation lookups
- Async computation of heavy text features where possible
Monitoring:
- Daily precision and recall tracking by sender type
- Alert if false positive rate exceeds 0.1%
- Feature distribution monitoring for drift
Common Pitfalls
| Pitfall | Issue |
|---|---|
| Premature complexity | Proposing deep learning without justification |
| Data assumptions | Assuming clean, labeled data availability |
| Single solution | Presenting one approach without discussing alternatives |
| Latency oversight | Designing models too slow for production requirements |
| Missing monitoring | No plan for post-launch observability |