Model selection requires balancing accuracy, latency, interpretability, and development time. This document covers the decision process for choosing appropriate models.
Start with simple models before adding complexity:
Loading diagram...
| Stage | Purpose |
|---|
| Baseline | Establish performance floor |
| Simple model | Determine if complexity is needed |
| Complex model | Improve performance if justified |
| Ensemble | Final accuracy gains if warranted |
A logistic regression baseline trained in hours provides information about problem difficulty and potential improvement ceiling.
| Problem Type | Baseline | Production Model |
|---|
| Binary Classification | Logistic Regression | Gradient Boosting, Neural Network |
| Multi-class Classification | Naive Bayes | Gradient Boosting, Transformer |
| Regression | Linear Regression | Gradient Boosting, Neural Network |
| Ranking | Pointwise Logistic Regression | Learning to Rank, Neural Ranker |
| Recommendation | Popularity, Collaborative Filtering | Two-Tower, Transformer |
| Sequence | Markov Model | LSTM, Transformer |
| Computer Vision | HOG + SVM | CNN, Vision Transformer |
| NLP | TF-IDF + Logistic Regression | BERT, fine-tuned LLM |
| Trade-off | Consideration |
|---|
| Accuracy vs Latency | Complex models improve accuracy but increase inference time |
| Complexity vs Interpretability | Deep networks sacrifice explainability |
| Training Cost vs Inference Cost | Large models require more resources |
| Data Requirements vs Generalization | Deep learning requires substantial data |
Models: Logistic Regression, Linear Regression, SVM
| Characteristic | Description |
|---|
| Training speed | Milliseconds to minutes |
| Inference speed | Sub-millisecond |
| Interpretability | High (coefficient inspection) |
| Non-linear patterns | Requires manual feature engineering |
| Data requirements | Works with limited data |
| Use Case | Description |
|---|
| Regulated industries | Interpretability requirements |
| Latency under 1ms | Strict inference constraints |
| Small datasets | Limited training data |
| Baselines | Appropriate as starting point |
Models: Decision Trees, Random Forest, Gradient Boosting (XGBoost, LightGBM)
| Characteristic | Description |
|---|
| Non-linear patterns | Automatic handling |
| Feature importance | Built-in |
| Mixed feature types | Minimal preprocessing required |
| Outlier handling | Robust (split-based) |
| Incremental updates | Generally requires full retraining |
| High-cardinality categoricals | Requires special handling |
| Use Case | Description |
|---|
| Tabular data | Default choice |
| Feature importance | Explainability requirements |
| Medium datasets | Millions of rows |
| Production ranking/classification | Proven reliability |
Models: MLP, CNN, RNN, Transformer
| Characteristic | Description |
|---|
| Unstructured data | State-of-the-art for images, text, audio, video |
| Feature learning | Automatic representation learning |
| Data requirements | Large datasets for training from scratch |
| Computational cost | GPU requirements |
| Interpretability | Limited without additional techniques |
| Transfer learning | Pretrained models available |
| Use Case | Description |
|---|
| Images, text, audio, video | Domain where neural networks dominate |
| Large datasets | Millions of labeled examples |
| Accuracy priority | Cost is secondary |
| Pretrained availability | Fine-tuning opportunities |
| Factor | Implication |
|---|
| Data size small | Simpler models preferred |
| Data size large | Complex models viable |
| High dimensionality | Regularization required |
| Tabular features | Tree models preferred |
| Unstructured data | Neural networks required |
| Noisy labels | Robust models, data augmentation |
| Requirement | Model Implication |
|---|
| Latency < 10ms | Linear models, optimized trees |
| Latency < 100ms | Most models viable |
| Interpretability required | Linear models, decision trees |
| Frequent retraining | Fast training, incremental learning |
| Constraint | Consideration |
|---|
| Training compute | GPU availability, time budget |
| Serving compute | Model size, optimization requirements |
| Team expertise | Familiarity with model types |
| Timeline | Implementation and iteration time |
Used in recommendations and search:
Loading diagram...
| Method | Description |
|---|
| Bagging | Train models on different data subsets (Random Forest) |
| Boosting | Train sequentially to correct errors (XGBoost) |
| Stacking | Use model predictions as meta-model features |
Share representations across related tasks:
Loading diagram...
| Method | Description | Use Case |
|---|
| Hold-out validation | Split data into train (70%), validation (15%), test (15%) | Standard evaluation with sufficient data |
| Cross-validation | Rotate through k folds, train on k-1, validate on 1 | Small datasets where every sample matters |
| Time-based split | Train on data before cutoff date, test on data after | Temporal data to prevent future information leakage |
Run systematic experiments across model types:
- Define experiment configurations for each model (logistic regression, random forest, XGBoost, neural network) with their hyperparameters
- Train each model on the training data
- Evaluate on validation data to collect metrics
- Record all experiments with their configurations and results for comparison
This enables structured comparison across algorithms and hyperparameter settings.
| Factor | Consideration |
|---|
| Primary metric | Performance on target metric |
| Variance | Stability across runs/folds |
| Inference latency | Meeting production requirements |
| Model size | Storage and memory constraints |
| Training time | Iteration speed |
| Topic | Guidance |
|---|
| Gradient boosting vs neural networks | Tabular data: gradient boosting. Unstructured data or large datasets: neural networks. Latency under 10ms: avoid large neural networks. |
| Ensemble vs single model | Use ensembles when accuracy gain justifies deployment complexity |
| Accuracy-latency trade-off | Profile latency early. Consider distillation, quantization, or simpler models. |
| Hyperparameter tuning | Grid search for small spaces, Bayesian optimization for large spaces, early stopping to reduce compute |
| Sufficiency determination | Model is sufficient when it beats baseline meaningfully, meets latency constraints, and improvements show diminishing returns |