Model Deployment

Model deployment involves transitioning trained models from development environments to production systems where they serve predictions to users. This process requires infrastructure for serving, monitoring, and maintaining models at scale.

Deployment Patterns

Batch Prediction

Batch prediction computes predictions on a schedule and stores results for later retrieval. Users receive precomputed predictions via lookup.

Loading diagram...

Characteristic	Description
Infrastructure	Batch job scheduler
Resource usage	Predictable, scheduled
Model complexity	No latency constraint
Prediction freshness	Depends on batch frequency
Storage requirements	Scales with prediction volume

Appropriate use cases:

Daily or weekly recommendations
High prediction volume (millions per day)
Complex models with high inference cost

Online Prediction

Online prediction computes predictions at request time. This pattern provides fresh predictions but requires low-latency infrastructure.

Loading diagram...

Characteristic	Description
Latency	Milliseconds required
Scaling	Must handle peak traffic
Feature serving	Requires real-time feature retrieval
Context	Can incorporate request-time signals

Appropriate use cases:

Fraud detection
Search ranking
Real-time personalization

Hybrid Approach

Hybrid deployment combines batch and online prediction:

Batch: Precompute predictions for known entities
Online: Handle new entities and incorporate real-time context

Model Serving Infrastructure

Serving Frameworks

Framework	Supported Models	Characteristics
TensorFlow Serving	TensorFlow	Mature, optimized for TF
TorchServe	PyTorch	Native PyTorch support
Triton Inference Server	Multiple frameworks	GPU optimization, batching
BentoML	Multiple frameworks	Simple deployment interface
MLflow	Multiple frameworks	Integrated experiment tracking

Serving Architecture

A typical prediction endpoint follows these steps:

Receive request: Accept the prediction request with entity identifiers
Retrieve features: Fetch pre-computed features from the feature store for the given entity
Generate prediction: Pass features through the loaded model to produce a prediction
Log for monitoring: Record the request, features, and prediction for observability and debugging
Return response: Send the prediction back to the caller

Scaling Strategies

Horizontal scaling:

Multiple model server replicas
Load balancer distribution
Auto-scaling based on traffic metrics

Model optimization:

Quantization (FP32 → INT8)
Pruning
Knowledge distillation
Compilation (TensorRT, ONNX)

Caching:

Cache frequent predictions
Cache feature lookups
Cache embedding computations

Deployment Strategies

Canary Deployment

Canary deployment routes a small percentage of traffic to a new model version while monitoring performance.

Week 1: 5% traffic to new model
Week 2: 25% traffic
Week 3: 50% traffic
Week 4: 100% traffic

Shadow Deployment

Shadow deployment runs a new model in parallel with production without serving its results. Predictions are logged for comparison.

Process:

Generate prediction using the production model
Asynchronously run the shadow model on the same request
Log the shadow prediction for offline comparison
Return only the production model result to the user

This allows validating the new model on real traffic without affecting users.

Blue-Green Deployment

Blue-green deployment maintains two complete environments. Traffic switches instantaneously between versions.

Loading diagram...

A/B Testing

A/B testing splits traffic between model versions for statistical comparison.

Loading diagram...

Requirements:

Random user assignment
Predetermined sample size
Statistical significance threshold
Sufficient test duration

Feature Serving

Online Feature Store

The online feature store provides low-latency access to pre-computed features. Features are stored in a key-value store (such as Redis) with keys structured as entity_type:entity_id:feature_name. At request time, the store retrieves multiple feature values in a single batch operation and returns them as a dictionary for model input.

Request-Time Feature Computation

Some features must be computed at request time because they depend on the current context:

Feature	Computation	Source
hour_of_day	Extract hour from request timestamp	Request metadata
day_of_week	Extract weekday (0-6) from timestamp	Request metadata
is_mobile	Parse user agent string	Request headers
query_length	Count words in query	Request payload

Production Monitoring

Metrics Categories

Category	Metrics
Model	Prediction distribution, confidence scores
System	Latency (p50, p99), throughput, error rate
Business	Click-through rate, conversion, revenue
Data	Feature drift, missing feature rate

Alerting Configuration

Alert Name	Condition	Severity
model_latency_high	p99 latency exceeds 200ms	Warning
prediction_drift	Mean prediction shifts by more than 0.1 from baseline	Critical
error_rate_high	Error rate exceeds 1%	Critical

Request Logging

For each prediction request, log the following with a unique request ID:

Request details: The incoming request parameters and metadata
Features used: The feature values retrieved and computed for this request
Prediction output: The model's prediction result

This enables debugging individual predictions, auditing model behavior, and correlating predictions with downstream outcomes.

Model Versioning

Version Management

Loading diagram...

Rollback Implementation

A model manager should support:

Version loading: Load model artifacts for a specific version into memory
Version switching: Set the active model version for serving
Rollback: Quickly revert to a previous version if issues are detected
Multi-version support: Keep multiple versions loaded in memory for instant switching

The rollback process simply redirects traffic to the previous version without requiring a new deployment.

Deployment Considerations

Consideration	Approach
Zero-downtime updates	Blue-green or canary deployment
A/B testing	Random assignment, predetermined sample size, statistical significance
Feature consistency	Shared feature computation between training and serving
Performance monitoring	Track predictions, latency, errors; alert on drift
Rollback capability	Maintain previous versions; use traffic routing for switches

Summary

Component	Requirement
Deployment automation	Reproducible, version-controlled deployments
Versioning	Models, features, configurations, data schemas
Monitoring	Proactive alerting on performance degradation
Rollback	Previous versions available for immediate switch
Staging environment	Production-equivalent testing environment

Deployment Patterns​

Batch Prediction​

Online Prediction​

Hybrid Approach​

Model Serving Infrastructure​

Serving Frameworks​

Serving Architecture​

Scaling Strategies​

Deployment Strategies​

Canary Deployment​

Shadow Deployment​

Blue-Green Deployment​

A/B Testing​

Feature Serving​

Online Feature Store​

Request-Time Feature Computation​

Production Monitoring​

Metrics Categories​

Alerting Configuration​

Request Logging​

Model Versioning​

Version Management​

Rollback Implementation​

Deployment Considerations​

Summary​

Table of Contents

Deployment Patterns

Batch Prediction

Online Prediction

Hybrid Approach

Model Serving Infrastructure

Serving Frameworks

Serving Architecture

Scaling Strategies

Deployment Strategies

Canary Deployment

Shadow Deployment

Blue-Green Deployment

A/B Testing

Feature Serving

Online Feature Store

Request-Time Feature Computation

Production Monitoring

Metrics Categories

Alerting Configuration

Request Logging

Model Versioning

Version Management

Rollback Implementation

Deployment Considerations

Summary