Model Deployment
Model deployment involves transitioning trained models from development environments to production systems where they serve predictions to users. This process requires infrastructure for serving, monitoring, and maintaining models at scale.
Deployment Patterns
Batch Prediction
Batch prediction computes predictions on a schedule and stores results for later retrieval. Users receive precomputed predictions via lookup.
| Characteristic | Description |
|---|---|
| Infrastructure | Batch job scheduler |
| Resource usage | Predictable, scheduled |
| Model complexity | No latency constraint |
| Prediction freshness | Depends on batch frequency |
| Storage requirements | Scales with prediction volume |
Appropriate use cases:
- Daily or weekly recommendations
- High prediction volume (millions per day)
- Complex models with high inference cost
Online Prediction
Online prediction computes predictions at request time. This pattern provides fresh predictions but requires low-latency infrastructure.
| Characteristic | Description |
|---|---|
| Latency | Milliseconds required |
| Scaling | Must handle peak traffic |
| Feature serving | Requires real-time feature retrieval |
| Context | Can incorporate request-time signals |
Appropriate use cases:
- Fraud detection
- Search ranking
- Real-time personalization
Hybrid Approach
Hybrid deployment combines batch and online prediction:
- Batch: Precompute predictions for known entities
- Online: Handle new entities and incorporate real-time context
Model Serving Infrastructure
Serving Frameworks
| Framework | Supported Models | Characteristics |
|---|---|---|
| TensorFlow Serving | TensorFlow | Mature, optimized for TF |
| TorchServe | PyTorch | Native PyTorch support |
| Triton Inference Server | Multiple frameworks | GPU optimization, batching |
| BentoML | Multiple frameworks | Simple deployment interface |
| MLflow | Multiple frameworks | Integrated experiment tracking |
Serving Architecture
A typical prediction endpoint follows these steps:
- Receive request: Accept the prediction request with entity identifiers
- Retrieve features: Fetch pre-computed features from the feature store for the given entity
- Generate prediction: Pass features through the loaded model to produce a prediction
- Log for monitoring: Record the request, features, and prediction for observability and debugging
- Return response: Send the prediction back to the caller
Scaling Strategies
Horizontal scaling:
- Multiple model server replicas
- Load balancer distribution
- Auto-scaling based on traffic metrics
Model optimization:
- Quantization (FP32 → INT8)
- Pruning
- Knowledge distillation
- Compilation (TensorRT, ONNX)
Caching:
- Cache frequent predictions
- Cache feature lookups
- Cache embedding computations
Deployment Strategies
Canary Deployment
Canary deployment routes a small percentage of traffic to a new model version while monitoring performance.
Week 1: 5% traffic to new model
Week 2: 25% traffic
Week 3: 50% traffic
Week 4: 100% traffic
Shadow Deployment
Shadow deployment runs a new model in parallel with production without serving its results. Predictions are logged for comparison.
Process:
- Generate prediction using the production model
- Asynchronously run the shadow model on the same request
- Log the shadow prediction for offline comparison
- Return only the production model result to the user
This allows validating the new model on real traffic without affecting users.
Blue-Green Deployment
Blue-green deployment maintains two complete environments. Traffic switches instantaneously between versions.
A/B Testing
A/B testing splits traffic between model versions for statistical comparison.
Requirements:
- Random user assignment
- Predetermined sample size
- Statistical significance threshold
- Sufficient test duration
Feature Serving
Online Feature Store
The online feature store provides low-latency access to pre-computed features. Features are stored in a key-value store (such as Redis) with keys structured as entity_type:entity_id:feature_name. At request time, the store retrieves multiple feature values in a single batch operation and returns them as a dictionary for model input.
Request-Time Feature Computation
Some features must be computed at request time because they depend on the current context:
| Feature | Computation | Source |
|---|---|---|
| hour_of_day | Extract hour from request timestamp | Request metadata |
| day_of_week | Extract weekday (0-6) from timestamp | Request metadata |
| is_mobile | Parse user agent string | Request headers |
| query_length | Count words in query | Request payload |
Production Monitoring
Metrics Categories
| Category | Metrics |
|---|---|
| Model | Prediction distribution, confidence scores |
| System | Latency (p50, p99), throughput, error rate |
| Business | Click-through rate, conversion, revenue |
| Data | Feature drift, missing feature rate |
Alerting Configuration
| Alert Name | Condition | Severity |
|---|---|---|
| model_latency_high | p99 latency exceeds 200ms | Warning |
| prediction_drift | Mean prediction shifts by more than 0.1 from baseline | Critical |
| error_rate_high | Error rate exceeds 1% | Critical |
Request Logging
For each prediction request, log the following with a unique request ID:
- Request details: The incoming request parameters and metadata
- Features used: The feature values retrieved and computed for this request
- Prediction output: The model's prediction result
This enables debugging individual predictions, auditing model behavior, and correlating predictions with downstream outcomes.
Model Versioning
Version Management
Rollback Implementation
A model manager should support:
- Version loading: Load model artifacts for a specific version into memory
- Version switching: Set the active model version for serving
- Rollback: Quickly revert to a previous version if issues are detected
- Multi-version support: Keep multiple versions loaded in memory for instant switching
The rollback process simply redirects traffic to the previous version without requiring a new deployment.
Deployment Considerations
| Consideration | Approach |
|---|---|
| Zero-downtime updates | Blue-green or canary deployment |
| A/B testing | Random assignment, predetermined sample size, statistical significance |
| Feature consistency | Shared feature computation between training and serving |
| Performance monitoring | Track predictions, latency, errors; alert on drift |
| Rollback capability | Maintain previous versions; use traffic routing for switches |
Summary
| Component | Requirement |
|---|---|
| Deployment automation | Reproducible, version-controlled deployments |
| Versioning | Models, features, configurations, data schemas |
| Monitoring | Proactive alerting on performance degradation |
| Rollback | Previous versions available for immediate switch |
| Staging environment | Production-equivalent testing environment |