Design an Ad Click Prediction System
Design a machine learning system to predict whether a user will click on an ad (the core of online advertising systems at Google, Meta, and Amazon).
Requirements
Functional:
- Predict P(click) for user-ad-context combinations
- Support millions of ads
- Handle cold-start (new ads, new users)
- Update model with fresh data frequently
Non-functional:
- Latency < 10ms at p99
- Handle 1M+ predictions per second
- Model updates within hours of new data
- High calibration (predicted P(click) matches actual rate)
Metrics
Model Metrics
| Metric | Description | Target |
|---|---|---|
| AUC-ROC | Ranking quality | > 0.75 |
| Log Loss | Calibration quality | < 0.4 |
| Normalized Entropy | Improvement over baseline | NE < 0.8 |
| Calibration Error | predicted - actual |
Business Metrics
| Metric | Description |
|---|---|
| Revenue per 1000 impressions (RPM) | Direct business impact |
| Click-through rate (CTR) | User engagement |
| Conversion rate | Post-click actions |
| Advertiser ROI | Are advertisers getting value? |
Calibration is critical. If the model predicts 5% CTR, roughly 5% of those impressions should actually be clicked. Miscalibration breaks the auction economics.
Architecture
Feature Engineering
Feature engineering is where CTR models differentiate. Ad CTR models often have thousands of features.
User Features
| Feature | Type | Description |
|---|---|---|
| User ID embedding | Embedding | Learned user representation |
| Demographics | Categorical | Age bucket, gender, income |
| Interests | Multi-hot | Interest categories from behavior |
| Historical CTR | Numerical | User's past click rate |
| Recency | Numerical | Days since last visit |
| Session depth | Numerical | Pages viewed this session |
| Device | Categorical | Mobile, desktop, tablet |
Ad Features
| Feature | Type | Description |
|---|---|---|
| Ad ID embedding | Embedding | Learned ad representation |
| Advertiser ID | Categorical | Brand/advertiser |
| Creative type | Categorical | Image, video, text |
| Ad category | Categorical | Product category |
| Historical CTR | Numerical | Ad's past click rate |
| Ad age | Numerical | Days since ad created |
| Landing page quality | Numerical | Page load time, relevance |
Context Features
| Feature | Type | Description |
|---|---|---|
| Page category | Categorical | News, shopping, social |
| Ad position | Numerical | Slot position on page |
| Time of day | Cyclical | Hour, encoded as sin/cos |
| Day of week | Categorical | Weekend vs weekday |
| Device features | Categorical | OS, browser, screen size |
| Geo | Categorical | Country, region, city |
Cross Features
Cross features capture interactions between feature categories:
| Feature | Type | Description |
|---|---|---|
| User x Ad category | Embedding | User preference for ad categories |
| User x Advertiser | Embedding | User-brand affinity |
| Device x Ad format | Interaction | Does this ad work on mobile? |
| Hour x Ad category | Interaction | Time-sensitive categories |
| User x Position | Numerical | User's position sensitivity |
Model Architecture
Wide & Deep Model
Wide: Memorizes specific combinations ("users who clicked this ad before")
Deep: Generalizes to new combinations ("users similar to past clickers")
Feature Interaction Models
| Model | Interaction Handling | Pros | Cons |
|---|---|---|---|
| Logistic Regression | Manual feature crosses | Fast, interpretable | Manual effort |
| Wide & Deep | Wide for memorization, deep for generalization | Good balance | Large model |
| DeepFM | FM layer for 2nd order interactions | Auto feature crosses | Limited to 2nd order |
| DCN | Cross network for explicit interactions | Higher-order crosses | More parameters |
| DIN | Attention over user history | History-aware | Complex serving |
Recommendation: Start with Wide & Deep or DCN. Add attention mechanisms (DIN) if user history is rich.
Embedding Design
Most features are categorical with high cardinality:
| Feature | Cardinality | Embedding Dim |
|---|---|---|
| User ID | 1 billion | 64 |
| Ad ID | 10 million | 32 |
| Advertiser ID | 1 million | 16 |
| Page ID | 10 million | 16 |
| Category | 10,000 | 8 |
Hashing trick: For extreme cardinality (user IDs), use feature hashing to fixed buckets. Accept some collisions.
Training
Sample Selection
Not all impressions are equal for training:
Negative downsampling: Clicks are rare (1-2%). Downsample negatives to balance classes. Apply importance weighting to correct for this.
Training Pipeline
Handling Data Freshness
Ad data has short shelf life. Yesterday's training data is already stale.
| Approach | Freshness | Trade-off |
|---|---|---|
| Batch training (daily) | 24 hours | Simple but stale |
| Incremental updates | Hours | Better freshness |
| Online learning | Minutes | Freshest but complex |
Recommendation: Batch training as baseline, incremental updates every few hours, online learning for time-sensitive features.
Serving
Latency Budget
Total: 10ms for CTR prediction (within larger 100ms auction)
| Stage | Budget |
|---|---|
| Feature lookup | 2ms |
| Feature engineering | 2ms |
| Model inference | 4ms |
| Overhead | 2ms |
Inference Optimization
Key optimizations:
- Pre-compute and cache user features
- Batch multiple ad predictions together
- Quantize model to INT8
- Prune low-value features
- Use specialized inference frameworks (TensorRT, ONNX)
Feature Store
Features must be consistent between training and serving:
Store pre-computed features. Serve the exact same features used in training.
Calibration
Calibration matters more than accuracy for ads.
Calibration Techniques
| Technique | Description |
|---|---|
| Platt scaling | Fit logistic regression on validation set |
| Isotonic regression | Non-parametric calibration |
| Temperature scaling | Divide logits by learned temperature |
| Histogram binning | Map predictions to observed rates |
Calibration Monitoring
If predicted 2% CTR actually clicks at 4%, the model is under-predicting. Advertisers overpay.
Feedback Loop and Cold Start
New Ad Cold Start
New ads have no click history. Solutions:
| Approach | Description |
|---|---|
| Content-based | Use ad text/image features |
| Explore-exploit | Show to small traffic to gather data |
| Similar ad features | Transfer from similar ads |
| Advertiser prior | Use advertiser's historical CTR |
Feedback Loop Risks
Ads predicted to have low CTR get shown less, confirming the prediction. Break this with:
- Exploration budget (show random ads sometimes)
- Upper confidence bound (UCB) ranking
- Thompson sampling
Monitoring
Key Alerts
| Metric | Alert Condition |
|---|---|
| CTR | Drop > 5% from baseline |
| Log loss | Increase > 0.05 |
| Calibration | Error > 0.02 |
| Revenue | Drop > 3% |
| Latency p99 | > 10ms |
A/B Testing
Every model change must be tested:
- Primary metric: Revenue (advertisers pay for clicks)
- Secondary: CTR, user experience metrics
- Guardrails: Latency, error rate
Run for statistical significance. Watch for long-term effects.
Reference
| Topic | Description |
|---|---|
| Click fraud | Separate fraud detection model. Filter fraudulent clicks from training data. |
| Position bias | Don't use position as a feature at training time. Or model position separately and remove its effect. |
| Fairness across advertisers | Calibrate per advertiser segment. Monitor win rates by advertiser size. |
| Viewability | Predict P(click |
| Accuracy vs latency | More features help but slow inference. Trade-off based on ad value. |
| Freshness vs stability | Frequent updates vs model stability. Incremental updates balance both. |