Design a Search Ranking System
Design a machine learning system to rank search results for e-commerce, job search, content discovery, or any search application.
Requirements
Functional:
- Rank search results by relevance
- Support text, structured, and hybrid queries
- Personalize based on user history
- Handle query understanding (spelling, synonyms)
- Return results in milliseconds
Non-functional:
- Latency < 100ms end-to-end
- Handle 100K+ QPS
- Update index within minutes of content changes
- Support A/B testing of ranking models
Metrics
Offline Metrics
| Metric | Description | Target |
|---|---|---|
| NDCG@10 | Normalized discounted cumulative gain at position 10 | > 0.65 |
| MRR | Mean reciprocal rank of first relevant result | > 0.5 |
| Precision@5 | Relevant results in top 5 / 5 | > 0.6 |
| Recall@100 | Relevant results in top 100 / Total relevant | > 0.9 |
Online Metrics
| Metric | Description |
|---|---|
| Click-through rate | Clicks / Impressions |
| Successful search rate | Searches with a click within 3 results |
| Time to first click | How quickly users find what they want |
| Reformulation rate | How often users modify their query |
| Session success | Did the user complete their task? |
NDCG is the standard offline metric. Online metrics (especially session success) are what matter in production.
Architecture
Query Understanding
Before ranking, understand what the user wants.
Query Processing Pipeline
| Step | Description | Example |
|---|---|---|
| Tokenization | Split into terms | "blue shoes" -> ["blue", "shoes"] |
| Normalization | Lowercase, remove punctuation | "iPhone!" -> "iphone" |
| Spell correction | Fix typos | "laptpo" -> "laptop" |
| Stemming | Reduce to root | "running" -> "run" |
| Synonym expansion | Add related terms | "laptop" -> "laptop OR notebook" |
| Query embedding | Dense vector representation | "laptop" -> [0.23, -0.15, ...] |
Intent Classification
Different intents require different ranking strategies:
| Intent | Query Example | Ranking Priority |
|---|---|---|
| Navigational | "amazon" | Exact match to destination |
| Informational | "how to cook rice" | Content quality, depth |
| Transactional | "buy iphone 15" | Price, availability, reviews |
| Local | "pizza near me" | Distance, ratings, hours |
Classify intent with a lightweight model. Use the classification to adjust feature weights.
Candidate Retrieval
Cannot run expensive ML on millions of documents. First, narrow down to candidates.
Dual Retrieval Strategy
Lexical (BM25): Finds exact keyword matches. Fast, interpretable. Misses semantic similarity.
Semantic (Vector): Finds conceptually similar documents. Catches synonyms and related concepts. Can miss exact matches.
Use both. BM25 catches the obvious matches. Vector retrieval finds semantic matches BM25 misses.
Vector Index Design
For semantic retrieval, use approximate nearest neighbor (ANN) search:
| Algorithm | Index Build | Query Time | Recall | Memory |
|---|---|---|---|---|
| Flat (brute force) | O(n) | O(n) | 100% | Low |
| IVF | O(n log n) | O(sqrt(n)) | 95%+ | Medium |
| HNSW | O(n log n) | O(log n) | 98%+ | High |
| Product Quantization | O(n) | O(n/compress) | 90%+ | Very Low |
Recommendation: HNSW for quality-sensitive applications. IVF+PQ for massive scale where memory matters.
Feature Engineering
Query-Document Features
| Feature | Type | Description |
|---|---|---|
| BM25 score | Numerical | Classic term-matching relevance |
| Query-doc embedding similarity | Numerical | Cosine similarity of embeddings |
| Title match | Binary/Numerical | Query terms in title |
| Exact phrase match | Binary | Entire query as phrase in document |
| Query coverage | Numerical | % of query terms in document |
| Term proximity | Numerical | How close are query terms? |
Document Quality Features
| Feature | Type | Description |
|---|---|---|
| Click popularity | Numerical | Historical clicks on this document |
| Dwell time | Numerical | Average time spent on document |
| Freshness | Numerical | Time since last update |
| Authority score | Numerical | PageRank, domain authority |
| Content quality | Numerical | Spam score, grammar, depth |
| Rating/reviews | Numerical | User ratings if applicable |
User Personalization Features
| Feature | Type | Description |
|---|---|---|
| User history similarity | Numerical | Similarity to user's past clicks |
| Category affinity | Numerical | User preference for document's category |
| Price sensitivity | Numerical | User's typical price range |
| Brand preference | Embedding | User's preferred brands |
| Recency of interest | Numerical | Recently clicked similar items |
Contextual Features
| Feature | Type | Description |
|---|---|---|
| Time of day | Categorical | Morning, afternoon, evening |
| Device | Categorical | Mobile vs desktop |
| Location | Categorical | City, country |
| Session context | Embedding | Previous queries in session |
Model Architecture
Learning to Rank Approaches
| Approach | Description | Pros | Cons |
|---|---|---|---|
| Pointwise | Predict relevance score per document | Simple | Ignores relative ordering |
| Pairwise | Predict which doc is more relevant | Better ordering | O(n^2) pairs |
| Listwise | Optimize for ranking metric directly | Best performance | Complex training |
Recommendation: Start with LambdaMART (pairwise, gradient boosted trees). Move to listwise neural models for better performance.
Two-Tower Architecture for Retrieval
Train with contrastive learning: clicked documents are positive, non-clicked are negative. Document embeddings are pre-computed; only query embedding is computed at runtime.
Multi-Stage Ranking Model
Each stage is more expensive but sees fewer candidates.
Training Data
Click Data Challenges
Click data is biased:
- Position bias: Higher positions get more clicks regardless of relevance
- Presentation bias: Thumbnails, snippets affect clicks
- Trust bias: Users trust the ranking
Debiasing Techniques
| Technique | Description |
|---|---|
| Position weighting | Weight clicks by 1/position |
| Inverse propensity scoring | Model probability of seeing the result |
| Randomized experiments | Occasionally show random orderings |
| Click models | Model the click as P(relevant) x P(examined) |
Training Data Pipeline
Serving
Latency Budget
Total: 100ms
| Stage | Budget | Notes |
|---|---|---|
| Query understanding | 10ms | Cached spell corrections |
| Retrieval | 20ms | Parallel BM25 + ANN |
| L1 ranking | 5ms | Simple model |
| L2 ranking | 30ms | Main ML model |
| L3 re-ranking | 5ms | Business rules |
| Response building | 10ms | Fetch snippets, images |
| Network/overhead | 20ms | Buffer |
Caching Strategy
- Query cache: Exact query matches (popular queries)
- Document cache: Pre-computed document features
- Embedding cache: Document embeddings for semantic search
Monitoring
Quality Metrics
| Metric | Alert Threshold |
|---|---|
| Relevance score distribution | Shift > 1 std |
| Zero-result rate | > 2% |
| Click-through rate | Drop > 10% |
| Latency p99 | > 100ms |
Online Experiments
Every ranking change should be A/B tested:
- Define primary metric (successful search rate)
- Define guardrails (latency, revenue)
- Run experiment for statistical significance
- Monitor for long-term effects
Reference
| Topic | Description |
|---|---|
| Cold-start for new documents | Use content features heavily for new documents. Explore-exploit: show new documents occasionally to gather data. |
| Position bias handling | Inverse propensity weighting during training. Randomization experiments to measure true relevance. |
| Personalization vs privacy | On-device personalization, federated learning, or aggregate cohorts instead of individual profiles. |
| Real-time inventory | Filter retrieval by availability. Include availability as a ranking feature. Re-rank frequently for fast-changing inventory. |
| Relevance vs engagement | Clickbait ranks well but frustrates users. Optimize for session success, not just clicks. |
| Personalization vs discovery | Show what users like vs. expose them to new things. Balance with diversity constraints. |