Model Evaluation

Model evaluation spans offline testing, shadow deployment, A/B testing, and production monitoring. Strong offline performance does not guarantee production success.

Evaluation Pipeline

Loading diagram...

Offline Metrics

Classification Metrics

Metric	Formula	Use Case
Accuracy	(TP + TN) / Total	Balanced classes
Precision	TP / (TP + FP)	High false positive cost
Recall	TP / (TP + FN)	High false negative cost
F1 Score	2 * (P * R) / (P + R)	Balance precision and recall
AUC-ROC	Area under ROC curve	Overall discrimination
AUC-PR	Area under PR curve	Imbalanced datasets

Confusion Matrix:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Ranking Metrics

Metric	Description	Formula
Precision@K	Relevant items in top K	Relevant@K / K
Recall@K	Relevant items found in top K	Relevant@K / Total Relevant
NDCG@K	Position-weighted relevance	DCG@K / IDCG@K
MRR	Position of first relevant item	1 / rank of first relevant
MAP	Mean average precision	Mean of AP across queries

NDCG Calculation:

DCG (Discounted Cumulative Gain) sums relevance scores with position-based discounting: for each position i, add relevance_i / log2(i + 2). IDCG is the DCG of the ideal ranking (items sorted by descending relevance). NDCG = DCG / IDCG, normalized to range [0, 1].

Regression Metrics

Metric	Formula	Interpretation
MAE	Mean(\|y - y_hat\|)	Average error magnitude
MSE	Mean((y - y_hat)^2)	Penalizes large errors
RMSE	sqrt(MSE)	Same units as target
MAPE	Mean(\|y - y_hat\| / y)	Percentage error
R^2	1 - SS_res / SS_tot	Variance explained

Recommendation Metrics

Metric	Description
Coverage	Percentage of items that can be recommended
Diversity	Difference between recommended items
Novelty	Unexpectedness of recommendations
Serendipity	Surprising yet relevant recommendations

Offline Evaluation Methods

Data Splits

Split Type	Description	Use Case
Time-based	Train on data before cutoff, test on data after	Temporal prediction problems, prevents future leakage
Stratified	Maintain class distribution in train and test	Imbalanced classification problems
User-based	Ensure no user appears in both train and test	Personalization models to prevent memorization

Slice Analysis

Evaluate on subgroups, not only aggregate metrics:

Slice	Definition	Why It Matters
New users	Account age under 7 days	Cold-start performance
Power users	Over 100 monthly actions	High-value user segment
Mobile	Platform is mobile	Device-specific behavior
By region	Group by geographic region	Regional performance differences

Compute metrics for each slice and compare to overall performance to identify underperforming segments.

Statistical Significance

To determine if one model is significantly better than another:

Run k-fold cross-validation for both models (e.g., 10 folds)
Collect the score from each fold for both models
Apply a paired t-test to compare the two sets of scores
A p-value below 0.05 indicates a statistically significant difference between models

Online Evaluation

A/B Testing

Process:

Define hypothesis and primary metric
Calculate required sample size
Randomly assign users to control and treatment
Run experiment until statistical significance
Analyze results

Sample size calculation:

Required sample size depends on:

Effect size: The minimum improvement to detect (e.g., 5% relative improvement)
Power: Probability of detecting a true effect (typically 80%)
Significance level: Acceptable false positive rate (typically 5%)
Group ratio: Relative size of control vs treatment (typically 1:1)

Larger effect sizes require smaller samples; smaller effects require larger samples to detect reliably.

Consideration	Description
Statistical power	Probability of detecting true effect
Minimum detectable effect	Smallest meaningful improvement
Multiple testing correction	Adjust for multiple comparisons
Novelty effects	Short-term behavior may differ from long-term

Shadow Mode

Run new model in parallel without affecting users:

Generate prediction using the production model
Also generate prediction using the shadow (candidate) model
Log the shadow prediction for offline analysis
Return only the production prediction to the user

This allows comparing the new model's behavior on live traffic without risk to users.

Interleaving

For ranking systems, interleave results from two models:

Model A results: [A1, A2, A3, A4, A5]
Model B results: [B1, B2, B3, B4, B5]
Interleaved:     [A1, B1, A2, B2, A3, ...]

Winner = model with more clicked items

Production Monitoring

Performance Tracking

Track and record key metrics:

Metric Category	Metrics
Model quality	Accuracy, precision, recall
Latency	p50 (median) and p99 (tail) latency

Record metrics continuously to monitoring systems for dashboards and alerting.

Data Drift Detection

For each feature, compare training and production distributions using the Kolmogorov-Smirnov test. If the p-value is below the threshold (e.g., 0.01), the feature has drifted significantly. Track which features are drifting to guide investigation and potential retraining.

Alert Thresholds

Metric	Warning	Critical
Accuracy	Baseline - 2%	Baseline - 5%
Latency p99	> 200ms	> 500ms
Prediction rate	< 95%	< 90%
Feature drift	> 3 features	> 5 features

Common Pitfalls

Pitfall	Description
Leakage	Using future information during training
Selection bias	Test set not representative of production traffic
Metric gaming	Optimizing measured metric instead of desired outcome
Offline-online gap	Strong offline, poor online performance
p-hacking	Running experiments until desired result appears

Reference

Topic	Guidance
Precision vs recall	False positives costly: optimize precision. False negatives costly: optimize recall.
Offline-online gap	Investigate root cause (data distribution, feature freshness, selection bias) before adjusting
A/B methodology	Define metrics and sample size upfront. Randomize properly. Wait for significance. Correct for multiple testing.
Model degradation	Monitor continuously. Set alert thresholds. Maintain rollback capability. Investigate root cause.
Ranking evaluation	Offline: NDCG, MAP. Online: CTR, engagement time, user satisfaction.

Evaluation Pipeline​

Offline Metrics​

Classification Metrics​

Ranking Metrics​

Regression Metrics​

Recommendation Metrics​

Offline Evaluation Methods​

Data Splits​

Slice Analysis​

Statistical Significance​

Online Evaluation​

A/B Testing​

Shadow Mode​

Interleaving​

Production Monitoring​

Performance Tracking​

Data Drift Detection​

Alert Thresholds​

Common Pitfalls​

Reference​

Table of Contents