Skip to main content

Design a Search Ranking System

Design a machine learning system to rank search results for e-commerce, job search, content discovery, or any search application.

Requirements

Functional:

  • Rank search results by relevance
  • Support text, structured, and hybrid queries
  • Personalize based on user history
  • Handle query understanding (spelling, synonyms)
  • Return results in milliseconds

Non-functional:

  • Latency < 100ms end-to-end
  • Handle 100K+ QPS
  • Update index within minutes of content changes
  • Support A/B testing of ranking models

Metrics

Offline Metrics

MetricDescriptionTarget
NDCG@10Normalized discounted cumulative gain at position 10> 0.65
MRRMean reciprocal rank of first relevant result> 0.5
Precision@5Relevant results in top 5 / 5> 0.6
Recall@100Relevant results in top 100 / Total relevant> 0.9

Online Metrics

MetricDescription
Click-through rateClicks / Impressions
Successful search rateSearches with a click within 3 results
Time to first clickHow quickly users find what they want
Reformulation rateHow often users modify their query
Session successDid the user complete their task?

NDCG is the standard offline metric. Online metrics (especially session success) are what matter in production.

Architecture

Loading diagram...

Query Understanding

Before ranking, understand what the user wants.

Query Processing Pipeline

StepDescriptionExample
TokenizationSplit into terms"blue shoes" -> ["blue", "shoes"]
NormalizationLowercase, remove punctuation"iPhone!" -> "iphone"
Spell correctionFix typos"laptpo" -> "laptop"
StemmingReduce to root"running" -> "run"
Synonym expansionAdd related terms"laptop" -> "laptop OR notebook"
Query embeddingDense vector representation"laptop" -> [0.23, -0.15, ...]

Intent Classification

Different intents require different ranking strategies:

IntentQuery ExampleRanking Priority
Navigational"amazon"Exact match to destination
Informational"how to cook rice"Content quality, depth
Transactional"buy iphone 15"Price, availability, reviews
Local"pizza near me"Distance, ratings, hours

Classify intent with a lightweight model. Use the classification to adjust feature weights.

Candidate Retrieval

Cannot run expensive ML on millions of documents. First, narrow down to candidates.

Dual Retrieval Strategy

Loading diagram...

Lexical (BM25): Finds exact keyword matches. Fast, interpretable. Misses semantic similarity.

Semantic (Vector): Finds conceptually similar documents. Catches synonyms and related concepts. Can miss exact matches.

Use both. BM25 catches the obvious matches. Vector retrieval finds semantic matches BM25 misses.

Vector Index Design

For semantic retrieval, use approximate nearest neighbor (ANN) search:

AlgorithmIndex BuildQuery TimeRecallMemory
Flat (brute force)O(n)O(n)100%Low
IVFO(n log n)O(sqrt(n))95%+Medium
HNSWO(n log n)O(log n)98%+High
Product QuantizationO(n)O(n/compress)90%+Very Low

Recommendation: HNSW for quality-sensitive applications. IVF+PQ for massive scale where memory matters.

Feature Engineering

Query-Document Features

FeatureTypeDescription
BM25 scoreNumericalClassic term-matching relevance
Query-doc embedding similarityNumericalCosine similarity of embeddings
Title matchBinary/NumericalQuery terms in title
Exact phrase matchBinaryEntire query as phrase in document
Query coverageNumerical% of query terms in document
Term proximityNumericalHow close are query terms?

Document Quality Features

FeatureTypeDescription
Click popularityNumericalHistorical clicks on this document
Dwell timeNumericalAverage time spent on document
FreshnessNumericalTime since last update
Authority scoreNumericalPageRank, domain authority
Content qualityNumericalSpam score, grammar, depth
Rating/reviewsNumericalUser ratings if applicable

User Personalization Features

FeatureTypeDescription
User history similarityNumericalSimilarity to user's past clicks
Category affinityNumericalUser preference for document's category
Price sensitivityNumericalUser's typical price range
Brand preferenceEmbeddingUser's preferred brands
Recency of interestNumericalRecently clicked similar items

Contextual Features

FeatureTypeDescription
Time of dayCategoricalMorning, afternoon, evening
DeviceCategoricalMobile vs desktop
LocationCategoricalCity, country
Session contextEmbeddingPrevious queries in session

Model Architecture

Learning to Rank Approaches

ApproachDescriptionProsCons
PointwisePredict relevance score per documentSimpleIgnores relative ordering
PairwisePredict which doc is more relevantBetter orderingO(n^2) pairs
ListwiseOptimize for ranking metric directlyBest performanceComplex training

Recommendation: Start with LambdaMART (pairwise, gradient boosted trees). Move to listwise neural models for better performance.

Two-Tower Architecture for Retrieval

Loading diagram...

Train with contrastive learning: clicked documents are positive, non-clicked are negative. Document embeddings are pre-computed; only query embedding is computed at runtime.

Multi-Stage Ranking Model

Loading diagram...

Each stage is more expensive but sees fewer candidates.

Training Data

Click Data Challenges

Click data is biased:

  • Position bias: Higher positions get more clicks regardless of relevance
  • Presentation bias: Thumbnails, snippets affect clicks
  • Trust bias: Users trust the ranking

Debiasing Techniques

TechniqueDescription
Position weightingWeight clicks by 1/position
Inverse propensity scoringModel probability of seeing the result
Randomized experimentsOccasionally show random orderings
Click modelsModel the click as P(relevant) x P(examined)

Training Data Pipeline

Loading diagram...

Serving

Latency Budget

Total: 100ms

StageBudgetNotes
Query understanding10msCached spell corrections
Retrieval20msParallel BM25 + ANN
L1 ranking5msSimple model
L2 ranking30msMain ML model
L3 re-ranking5msBusiness rules
Response building10msFetch snippets, images
Network/overhead20msBuffer

Caching Strategy

Loading diagram...
  • Query cache: Exact query matches (popular queries)
  • Document cache: Pre-computed document features
  • Embedding cache: Document embeddings for semantic search

Monitoring

Quality Metrics

MetricAlert Threshold
Relevance score distributionShift > 1 std
Zero-result rate> 2%
Click-through rateDrop > 10%
Latency p99> 100ms

Online Experiments

Every ranking change should be A/B tested:

  1. Define primary metric (successful search rate)
  2. Define guardrails (latency, revenue)
  3. Run experiment for statistical significance
  4. Monitor for long-term effects

Reference

TopicDescription
Cold-start for new documentsUse content features heavily for new documents. Explore-exploit: show new documents occasionally to gather data.
Position bias handlingInverse propensity weighting during training. Randomization experiments to measure true relevance.
Personalization vs privacyOn-device personalization, federated learning, or aggregate cohorts instead of individual profiles.
Real-time inventoryFilter retrieval by availability. Include availability as a ranking feature. Re-rank frequently for fast-changing inventory.
Relevance vs engagementClickbait ranks well but frustrates users. Optimize for session success, not just clicks.
Personalization vs discoveryShow what users like vs. expose them to new things. Balance with diversity constraints.