Skip to main content

Model Deployment

Model deployment involves transitioning trained models from development environments to production systems where they serve predictions to users. This process requires infrastructure for serving, monitoring, and maintaining models at scale.

Deployment Patterns

Batch Prediction

Batch prediction computes predictions on a schedule and stores results for later retrieval. Users receive precomputed predictions via lookup.

Loading diagram...
CharacteristicDescription
InfrastructureBatch job scheduler
Resource usagePredictable, scheduled
Model complexityNo latency constraint
Prediction freshnessDepends on batch frequency
Storage requirementsScales with prediction volume

Appropriate use cases:

  • Daily or weekly recommendations
  • High prediction volume (millions per day)
  • Complex models with high inference cost

Online Prediction

Online prediction computes predictions at request time. This pattern provides fresh predictions but requires low-latency infrastructure.

Loading diagram...
CharacteristicDescription
LatencyMilliseconds required
ScalingMust handle peak traffic
Feature servingRequires real-time feature retrieval
ContextCan incorporate request-time signals

Appropriate use cases:

  • Fraud detection
  • Search ranking
  • Real-time personalization

Hybrid Approach

Hybrid deployment combines batch and online prediction:

  • Batch: Precompute predictions for known entities
  • Online: Handle new entities and incorporate real-time context

Model Serving Infrastructure

Serving Frameworks

FrameworkSupported ModelsCharacteristics
TensorFlow ServingTensorFlowMature, optimized for TF
TorchServePyTorchNative PyTorch support
Triton Inference ServerMultiple frameworksGPU optimization, batching
BentoMLMultiple frameworksSimple deployment interface
MLflowMultiple frameworksIntegrated experiment tracking

Serving Architecture

A typical prediction endpoint follows these steps:

  1. Receive request: Accept the prediction request with entity identifiers
  2. Retrieve features: Fetch pre-computed features from the feature store for the given entity
  3. Generate prediction: Pass features through the loaded model to produce a prediction
  4. Log for monitoring: Record the request, features, and prediction for observability and debugging
  5. Return response: Send the prediction back to the caller

Scaling Strategies

Horizontal scaling:

  • Multiple model server replicas
  • Load balancer distribution
  • Auto-scaling based on traffic metrics

Model optimization:

  • Quantization (FP32 → INT8)
  • Pruning
  • Knowledge distillation
  • Compilation (TensorRT, ONNX)

Caching:

  • Cache frequent predictions
  • Cache feature lookups
  • Cache embedding computations

Deployment Strategies

Canary Deployment

Canary deployment routes a small percentage of traffic to a new model version while monitoring performance.

Week 1: 5% traffic to new model
Week 2: 25% traffic
Week 3: 50% traffic
Week 4: 100% traffic

Shadow Deployment

Shadow deployment runs a new model in parallel with production without serving its results. Predictions are logged for comparison.

Process:

  1. Generate prediction using the production model
  2. Asynchronously run the shadow model on the same request
  3. Log the shadow prediction for offline comparison
  4. Return only the production model result to the user

This allows validating the new model on real traffic without affecting users.

Blue-Green Deployment

Blue-green deployment maintains two complete environments. Traffic switches instantaneously between versions.

Loading diagram...

A/B Testing

A/B testing splits traffic between model versions for statistical comparison.

Loading diagram...

Requirements:

  • Random user assignment
  • Predetermined sample size
  • Statistical significance threshold
  • Sufficient test duration

Feature Serving

Online Feature Store

The online feature store provides low-latency access to pre-computed features. Features are stored in a key-value store (such as Redis) with keys structured as entity_type:entity_id:feature_name. At request time, the store retrieves multiple feature values in a single batch operation and returns them as a dictionary for model input.

Request-Time Feature Computation

Some features must be computed at request time because they depend on the current context:

FeatureComputationSource
hour_of_dayExtract hour from request timestampRequest metadata
day_of_weekExtract weekday (0-6) from timestampRequest metadata
is_mobileParse user agent stringRequest headers
query_lengthCount words in queryRequest payload

Production Monitoring

Metrics Categories

CategoryMetrics
ModelPrediction distribution, confidence scores
SystemLatency (p50, p99), throughput, error rate
BusinessClick-through rate, conversion, revenue
DataFeature drift, missing feature rate

Alerting Configuration

Alert NameConditionSeverity
model_latency_highp99 latency exceeds 200msWarning
prediction_driftMean prediction shifts by more than 0.1 from baselineCritical
error_rate_highError rate exceeds 1%Critical

Request Logging

For each prediction request, log the following with a unique request ID:

  1. Request details: The incoming request parameters and metadata
  2. Features used: The feature values retrieved and computed for this request
  3. Prediction output: The model's prediction result

This enables debugging individual predictions, auditing model behavior, and correlating predictions with downstream outcomes.

Model Versioning

Version Management

Loading diagram...

Rollback Implementation

A model manager should support:

  1. Version loading: Load model artifacts for a specific version into memory
  2. Version switching: Set the active model version for serving
  3. Rollback: Quickly revert to a previous version if issues are detected
  4. Multi-version support: Keep multiple versions loaded in memory for instant switching

The rollback process simply redirects traffic to the previous version without requiring a new deployment.

Deployment Considerations

ConsiderationApproach
Zero-downtime updatesBlue-green or canary deployment
A/B testingRandom assignment, predetermined sample size, statistical significance
Feature consistencyShared feature computation between training and serving
Performance monitoringTrack predictions, latency, errors; alert on drift
Rollback capabilityMaintain previous versions; use traffic routing for switches

Summary

ComponentRequirement
Deployment automationReproducible, version-controlled deployments
VersioningModels, features, configurations, data schemas
MonitoringProactive alerting on performance degradation
RollbackPrevious versions available for immediate switch
Staging environmentProduction-equivalent testing environment