Skip to main content

Model Evaluation

Model evaluation spans offline testing, shadow deployment, A/B testing, and production monitoring. Strong offline performance does not guarantee production success.

Evaluation Pipeline

Loading diagram...

Offline Metrics

Classification Metrics

MetricFormulaUse Case
Accuracy(TP + TN) / TotalBalanced classes
PrecisionTP / (TP + FP)High false positive cost
RecallTP / (TP + FN)High false negative cost
F1 Score2 * (P * R) / (P + R)Balance precision and recall
AUC-ROCArea under ROC curveOverall discrimination
AUC-PRArea under PR curveImbalanced datasets

Confusion Matrix:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Ranking Metrics

MetricDescriptionFormula
Precision@KRelevant items in top KRelevant@K / K
Recall@KRelevant items found in top KRelevant@K / Total Relevant
NDCG@KPosition-weighted relevanceDCG@K / IDCG@K
MRRPosition of first relevant item1 / rank of first relevant
MAPMean average precisionMean of AP across queries

NDCG Calculation:

DCG (Discounted Cumulative Gain) sums relevance scores with position-based discounting: for each position i, add relevance_i / log2(i + 2). IDCG is the DCG of the ideal ranking (items sorted by descending relevance). NDCG = DCG / IDCG, normalized to range [0, 1].

Regression Metrics

MetricFormulaInterpretation
MAEMean(|y - y_hat|)Average error magnitude
MSEMean((y - y_hat)^2)Penalizes large errors
RMSEsqrt(MSE)Same units as target
MAPEMean(|y - y_hat| / y)Percentage error
R^21 - SS_res / SS_totVariance explained

Recommendation Metrics

MetricDescription
CoveragePercentage of items that can be recommended
DiversityDifference between recommended items
NoveltyUnexpectedness of recommendations
SerendipitySurprising yet relevant recommendations

Offline Evaluation Methods

Data Splits

Split TypeDescriptionUse Case
Time-basedTrain on data before cutoff, test on data afterTemporal prediction problems, prevents future leakage
StratifiedMaintain class distribution in train and testImbalanced classification problems
User-basedEnsure no user appears in both train and testPersonalization models to prevent memorization

Slice Analysis

Evaluate on subgroups, not only aggregate metrics:

SliceDefinitionWhy It Matters
New usersAccount age under 7 daysCold-start performance
Power usersOver 100 monthly actionsHigh-value user segment
MobilePlatform is mobileDevice-specific behavior
By regionGroup by geographic regionRegional performance differences

Compute metrics for each slice and compare to overall performance to identify underperforming segments.

Statistical Significance

To determine if one model is significantly better than another:

  1. Run k-fold cross-validation for both models (e.g., 10 folds)
  2. Collect the score from each fold for both models
  3. Apply a paired t-test to compare the two sets of scores
  4. A p-value below 0.05 indicates a statistically significant difference between models

Online Evaluation

A/B Testing

Process:

  1. Define hypothesis and primary metric
  2. Calculate required sample size
  3. Randomly assign users to control and treatment
  4. Run experiment until statistical significance
  5. Analyze results

Sample size calculation:

Required sample size depends on:

  • Effect size: The minimum improvement to detect (e.g., 5% relative improvement)
  • Power: Probability of detecting a true effect (typically 80%)
  • Significance level: Acceptable false positive rate (typically 5%)
  • Group ratio: Relative size of control vs treatment (typically 1:1)

Larger effect sizes require smaller samples; smaller effects require larger samples to detect reliably.

ConsiderationDescription
Statistical powerProbability of detecting true effect
Minimum detectable effectSmallest meaningful improvement
Multiple testing correctionAdjust for multiple comparisons
Novelty effectsShort-term behavior may differ from long-term

Shadow Mode

Run new model in parallel without affecting users:

  1. Generate prediction using the production model
  2. Also generate prediction using the shadow (candidate) model
  3. Log the shadow prediction for offline analysis
  4. Return only the production prediction to the user

This allows comparing the new model's behavior on live traffic without risk to users.

Interleaving

For ranking systems, interleave results from two models:

Model A results: [A1, A2, A3, A4, A5]
Model B results: [B1, B2, B3, B4, B5]
Interleaved: [A1, B1, A2, B2, A3, ...]

Winner = model with more clicked items

Production Monitoring

Performance Tracking

Track and record key metrics:

Metric CategoryMetrics
Model qualityAccuracy, precision, recall
Latencyp50 (median) and p99 (tail) latency

Record metrics continuously to monitoring systems for dashboards and alerting.

Data Drift Detection

For each feature, compare training and production distributions using the Kolmogorov-Smirnov test. If the p-value is below the threshold (e.g., 0.01), the feature has drifted significantly. Track which features are drifting to guide investigation and potential retraining.

Alert Thresholds

MetricWarningCritical
AccuracyBaseline - 2%Baseline - 5%
Latency p99> 200ms> 500ms
Prediction rate< 95%< 90%
Feature drift> 3 features> 5 features

Common Pitfalls

PitfallDescription
LeakageUsing future information during training
Selection biasTest set not representative of production traffic
Metric gamingOptimizing measured metric instead of desired outcome
Offline-online gapStrong offline, poor online performance
p-hackingRunning experiments until desired result appears

Reference

TopicGuidance
Precision vs recallFalse positives costly: optimize precision. False negatives costly: optimize recall.
Offline-online gapInvestigate root cause (data distribution, feature freshness, selection bias) before adjusting
A/B methodologyDefine metrics and sample size upfront. Randomize properly. Wait for significance. Correct for multiple testing.
Model degradationMonitor continuously. Set alert thresholds. Maintain rollback capability. Investigate root cause.
Ranking evaluationOffline: NDCG, MAP. Online: CTR, engagement time, user satisfaction.