Model Evaluation
Model evaluation spans offline testing, shadow deployment, A/B testing, and production monitoring. Strong offline performance does not guarantee production success.
Evaluation Pipeline
Offline Metrics
Classification Metrics
| Metric | Formula | Use Case |
|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes |
| Precision | TP / (TP + FP) | High false positive cost |
| Recall | TP / (TP + FN) | High false negative cost |
| F1 Score | 2 * (P * R) / (P + R) | Balance precision and recall |
| AUC-ROC | Area under ROC curve | Overall discrimination |
| AUC-PR | Area under PR curve | Imbalanced datasets |
Confusion Matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Ranking Metrics
| Metric | Description | Formula |
|---|---|---|
| Precision@K | Relevant items in top K | Relevant@K / K |
| Recall@K | Relevant items found in top K | Relevant@K / Total Relevant |
| NDCG@K | Position-weighted relevance | DCG@K / IDCG@K |
| MRR | Position of first relevant item | 1 / rank of first relevant |
| MAP | Mean average precision | Mean of AP across queries |
NDCG Calculation:
DCG (Discounted Cumulative Gain) sums relevance scores with position-based discounting: for each position i, add relevance_i / log2(i + 2). IDCG is the DCG of the ideal ranking (items sorted by descending relevance). NDCG = DCG / IDCG, normalized to range [0, 1].
Regression Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | Mean(|y - y_hat|) | Average error magnitude |
| MSE | Mean((y - y_hat)^2) | Penalizes large errors |
| RMSE | sqrt(MSE) | Same units as target |
| MAPE | Mean(|y - y_hat| / y) | Percentage error |
| R^2 | 1 - SS_res / SS_tot | Variance explained |
Recommendation Metrics
| Metric | Description |
|---|---|
| Coverage | Percentage of items that can be recommended |
| Diversity | Difference between recommended items |
| Novelty | Unexpectedness of recommendations |
| Serendipity | Surprising yet relevant recommendations |
Offline Evaluation Methods
Data Splits
| Split Type | Description | Use Case |
|---|---|---|
| Time-based | Train on data before cutoff, test on data after | Temporal prediction problems, prevents future leakage |
| Stratified | Maintain class distribution in train and test | Imbalanced classification problems |
| User-based | Ensure no user appears in both train and test | Personalization models to prevent memorization |
Slice Analysis
Evaluate on subgroups, not only aggregate metrics:
| Slice | Definition | Why It Matters |
|---|---|---|
| New users | Account age under 7 days | Cold-start performance |
| Power users | Over 100 monthly actions | High-value user segment |
| Mobile | Platform is mobile | Device-specific behavior |
| By region | Group by geographic region | Regional performance differences |
Compute metrics for each slice and compare to overall performance to identify underperforming segments.
Statistical Significance
To determine if one model is significantly better than another:
- Run k-fold cross-validation for both models (e.g., 10 folds)
- Collect the score from each fold for both models
- Apply a paired t-test to compare the two sets of scores
- A p-value below 0.05 indicates a statistically significant difference between models
Online Evaluation
A/B Testing
Process:
- Define hypothesis and primary metric
- Calculate required sample size
- Randomly assign users to control and treatment
- Run experiment until statistical significance
- Analyze results
Sample size calculation:
Required sample size depends on:
- Effect size: The minimum improvement to detect (e.g., 5% relative improvement)
- Power: Probability of detecting a true effect (typically 80%)
- Significance level: Acceptable false positive rate (typically 5%)
- Group ratio: Relative size of control vs treatment (typically 1:1)
Larger effect sizes require smaller samples; smaller effects require larger samples to detect reliably.
| Consideration | Description |
|---|---|
| Statistical power | Probability of detecting true effect |
| Minimum detectable effect | Smallest meaningful improvement |
| Multiple testing correction | Adjust for multiple comparisons |
| Novelty effects | Short-term behavior may differ from long-term |
Shadow Mode
Run new model in parallel without affecting users:
- Generate prediction using the production model
- Also generate prediction using the shadow (candidate) model
- Log the shadow prediction for offline analysis
- Return only the production prediction to the user
This allows comparing the new model's behavior on live traffic without risk to users.
Interleaving
For ranking systems, interleave results from two models:
Model A results: [A1, A2, A3, A4, A5]
Model B results: [B1, B2, B3, B4, B5]
Interleaved: [A1, B1, A2, B2, A3, ...]
Winner = model with more clicked items
Production Monitoring
Performance Tracking
Track and record key metrics:
| Metric Category | Metrics |
|---|---|
| Model quality | Accuracy, precision, recall |
| Latency | p50 (median) and p99 (tail) latency |
Record metrics continuously to monitoring systems for dashboards and alerting.
Data Drift Detection
For each feature, compare training and production distributions using the Kolmogorov-Smirnov test. If the p-value is below the threshold (e.g., 0.01), the feature has drifted significantly. Track which features are drifting to guide investigation and potential retraining.
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Accuracy | Baseline - 2% | Baseline - 5% |
| Latency p99 | > 200ms | > 500ms |
| Prediction rate | < 95% | < 90% |
| Feature drift | > 3 features | > 5 features |
Common Pitfalls
| Pitfall | Description |
|---|---|
| Leakage | Using future information during training |
| Selection bias | Test set not representative of production traffic |
| Metric gaming | Optimizing measured metric instead of desired outcome |
| Offline-online gap | Strong offline, poor online performance |
| p-hacking | Running experiments until desired result appears |
Reference
| Topic | Guidance |
|---|---|
| Precision vs recall | False positives costly: optimize precision. False negatives costly: optimize recall. |
| Offline-online gap | Investigate root cause (data distribution, feature freshness, selection bias) before adjusting |
| A/B methodology | Define metrics and sample size upfront. Randomize properly. Wait for significance. Correct for multiple testing. |
| Model degradation | Monitor continuously. Set alert thresholds. Maintain rollback capability. Investigate root cause. |
| Ranking evaluation | Offline: NDCG, MAP. Online: CTR, engagement time, user satisfaction. |