Skip to main content

Model Selection

Model selection requires balancing accuracy, latency, interpretability, and development time. This document covers the decision process for choosing appropriate models.

Selection Process

Baseline First

Start with simple models before adding complexity:

Loading diagram...
StagePurpose
BaselineEstablish performance floor
Simple modelDetermine if complexity is needed
Complex modelImprove performance if justified
EnsembleFinal accuracy gains if warranted

A logistic regression baseline trained in hours provides information about problem difficulty and potential improvement ceiling.

Problem Type Mapping

Problem TypeBaselineProduction Model
Binary ClassificationLogistic RegressionGradient Boosting, Neural Network
Multi-class ClassificationNaive BayesGradient Boosting, Transformer
RegressionLinear RegressionGradient Boosting, Neural Network
RankingPointwise Logistic RegressionLearning to Rank, Neural Ranker
RecommendationPopularity, Collaborative FilteringTwo-Tower, Transformer
SequenceMarkov ModelLSTM, Transformer
Computer VisionHOG + SVMCNN, Vision Transformer
NLPTF-IDF + Logistic RegressionBERT, fine-tuned LLM

Model Trade-offs

Trade-offConsideration
Accuracy vs LatencyComplex models improve accuracy but increase inference time
Complexity vs InterpretabilityDeep networks sacrifice explainability
Training Cost vs Inference CostLarge models require more resources
Data Requirements vs GeneralizationDeep learning requires substantial data

Model Families

Linear Models

Models: Logistic Regression, Linear Regression, SVM

CharacteristicDescription
Training speedMilliseconds to minutes
Inference speedSub-millisecond
InterpretabilityHigh (coefficient inspection)
Non-linear patternsRequires manual feature engineering
Data requirementsWorks with limited data
Use CaseDescription
Regulated industriesInterpretability requirements
Latency under 1msStrict inference constraints
Small datasetsLimited training data
BaselinesAppropriate as starting point

Tree-Based Models

Models: Decision Trees, Random Forest, Gradient Boosting (XGBoost, LightGBM)

CharacteristicDescription
Non-linear patternsAutomatic handling
Feature importanceBuilt-in
Mixed feature typesMinimal preprocessing required
Outlier handlingRobust (split-based)
Incremental updatesGenerally requires full retraining
High-cardinality categoricalsRequires special handling
Use CaseDescription
Tabular dataDefault choice
Feature importanceExplainability requirements
Medium datasetsMillions of rows
Production ranking/classificationProven reliability

Neural Networks

Models: MLP, CNN, RNN, Transformer

CharacteristicDescription
Unstructured dataState-of-the-art for images, text, audio, video
Feature learningAutomatic representation learning
Data requirementsLarge datasets for training from scratch
Computational costGPU requirements
InterpretabilityLimited without additional techniques
Transfer learningPretrained models available
Use CaseDescription
Images, text, audio, videoDomain where neural networks dominate
Large datasetsMillions of labeled examples
Accuracy priorityCost is secondary
Pretrained availabilityFine-tuning opportunities

Selection Criteria

Data Characteristics

FactorImplication
Data size smallSimpler models preferred
Data size largeComplex models viable
High dimensionalityRegularization required
Tabular featuresTree models preferred
Unstructured dataNeural networks required
Noisy labelsRobust models, data augmentation

Production Requirements

RequirementModel Implication
Latency < 10msLinear models, optimized trees
Latency < 100msMost models viable
Interpretability requiredLinear models, decision trees
Frequent retrainingFast training, incremental learning

Resource Constraints

ConstraintConsideration
Training computeGPU availability, time budget
Serving computeModel size, optimization requirements
Team expertiseFamiliarity with model types
TimelineImplementation and iteration time

Architecture Patterns

Two-Stage Ranking

Used in recommendations and search:

Loading diagram...

Ensemble Methods

MethodDescription
BaggingTrain models on different data subsets (Random Forest)
BoostingTrain sequentially to correct errors (XGBoost)
StackingUse model predictions as meta-model features

Multi-Task Learning

Share representations across related tasks:

Loading diagram...

Model Comparison

Evaluation Protocol

MethodDescriptionUse Case
Hold-out validationSplit data into train (70%), validation (15%), test (15%)Standard evaluation with sufficient data
Cross-validationRotate through k folds, train on k-1, validate on 1Small datasets where every sample matters
Time-based splitTrain on data before cutoff date, test on data afterTemporal data to prevent future information leakage

Experiment Tracking

Run systematic experiments across model types:

  1. Define experiment configurations for each model (logistic regression, random forest, XGBoost, neural network) with their hyperparameters
  2. Train each model on the training data
  3. Evaluate on validation data to collect metrics
  4. Record all experiments with their configurations and results for comparison

This enables structured comparison across algorithms and hyperparameter settings.

Analysis Factors

FactorConsideration
Primary metricPerformance on target metric
VarianceStability across runs/folds
Inference latencyMeeting production requirements
Model sizeStorage and memory constraints
Training timeIteration speed

Reference

TopicGuidance
Gradient boosting vs neural networksTabular data: gradient boosting. Unstructured data or large datasets: neural networks. Latency under 10ms: avoid large neural networks.
Ensemble vs single modelUse ensembles when accuracy gain justifies deployment complexity
Accuracy-latency trade-offProfile latency early. Consider distillation, quantization, or simpler models.
Hyperparameter tuningGrid search for small spaces, Bayesian optimization for large spaces, early stopping to reduce compute
Sufficiency determinationModel is sufficient when it beats baseline meaningfully, meets latency constraints, and improvements show diminishing returns