Design a Search Engine
Build a search engine that can index billions of web pages and return relevant results in under 200ms.
Related Concepts: Database Scaling, Load Balancing, CDN
Step 1: Requirements and Scope
Functional Requirements
| Requirement | Description |
|---|---|
| Web crawling | Discover and fetch web pages |
| Indexing | Process and store page content |
| Query processing | Parse and understand search queries |
| Ranking | Return relevant results ordered by quality |
| Snippet generation | Show previews of matching content |
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | < 200ms p99 | Users abandon slow searches |
| Freshness | Hours for news, days for general content | Stale results frustrate users |
| Availability | 99.99% | Search is mission-critical |
| Scale | 100B+ pages indexed | Cover the web |
Scale Estimation
- Web pages: 100 billion indexed pages
- Average page size: 100KB (after HTML parsing)
- Queries per day: 10 billion (115K QPS)
- Index size: Compressed inverted index ~10PB
- Crawl rate: 1 billion pages/day to stay fresh
Step 2: High-Level Architecture
Step 3: Web Crawling
The crawler discovers and fetches billions of pages efficiently.
URL Frontier Design
The frontier is a sophisticated scheduler, not a simple queue.
Politeness: One request per domain every few seconds to avoid overloading sites.
Priority: Important pages (news sites, popular domains) get crawled more frequently.
Deduplication: Use a Bloom filter for quick checks, then verify against the full URL database.
Crawling at Scale
| Challenge | Solution |
|---|---|
| DNS bottleneck | Local DNS cache, batch lookups |
| robots.txt | Cache per domain, refresh weekly |
| Duplicates | URL normalization, content hashing |
| Spider traps | Max depth, max pages per domain |
| Freshness | Re-crawl frequency based on change rate |
Step 4: Indexing
The index enables fast lookups instead of scanning 100 billion pages per query.
Inverted Index
For each word, store a list of documents containing it. Search becomes:
- Look up each query term
- Intersect the posting lists
- Rank the results
Posting List Structure
Each entry in a posting list contains:
- doc_id: The unique document identifier (such as 12345)
- term_frequency: How many times the term appears in this document (such as 3)
- positions: The word positions where the term occurs (such as positions 15, 89, and 234)
- field: Which part of the document contains the term (title, body, or url)
Positions enable phrase queries (finding "quick brown fox" as a consecutive phrase). Field information enables field boosting (title matches rank higher than body matches).
Index Partitioning
100 billion documents require partitioning across machines.
| Approach | Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| Document partitioned | Each shard has all terms for a subset of docs | Query goes to all shards | All shards hit on every query |
| Term partitioned | Each shard has all docs for a subset of terms | Query only hits relevant shards | Popular terms cause hotspots |
Recommendation: Document-partitioned. Simpler and handles popularity skew better.
Step 5: Query Processing
Query Understanding
Parse the query before hitting the index:
Spell correction: "restuarant" to "restaurant" using edit distance and query logs.
Query expansion: Add synonyms, handle acronyms (NYC to New York City).
Intent detection: "weather" likely wants current weather, not Wikipedia article.
Query Execution
Each shard returns its top-K results. The query processor merges these and performs final ranking.
Step 6: Ranking
Relevance Scoring
TF-IDF: Basic relevance metric.
- Term Frequency: How often does this word appear in the document?
- Inverse Document Frequency: How rare is this word across all documents?
The TF-IDF score multiplies the term frequency by the logarithm of the total document count divided by the number of documents containing that term. Rare terms contribute more to relevance than common terms.
BM25: Modern improvement over TF-IDF. Handles document length normalization and term frequency saturation.
PageRank and Link Analysis
PageRank: Pages linked to by important pages are themselves important. Run iteratively until scores converge.
Two-Phase Ranking
Expensive ML models cannot run on millions of candidates. Filter first with cheap signals, then rank carefully.
Ranking Signals
| Signal Type | Examples |
|---|---|
| Query-document | BM25, title match, URL match |
| Document quality | PageRank, domain authority, spam score |
| Freshness | Last modified, crawl date |
| User engagement | Click-through rate, dwell time |
| Personalization | Location, search history, language |
Step 7: Serving Infrastructure
Tiered Caching
Result cache: Store complete results for popular queries. "weather" and "facebook" are asked millions of times.
Posting list cache: Keep posting lists for common terms in memory.
Geographic Distribution
Users in Tokyo should not wait for a round trip to California.
Full index replicas in each major region. Route users to nearest data center.
Step 8: Freshness
Some content needs fast indexing.
Tiered Crawling
| Tier | Content Type | Crawl Frequency | Index Latency |
|---|---|---|---|
| Real-time | Breaking news, social | Continuous | Seconds |
| Frequent | News sites, popular blogs | Hourly | Minutes |
| Normal | Most web pages | Daily-weekly | Hours |
| Archive | Old/static content | Monthly | Days |
Incremental Indexing
Do not rebuild the entire index when one page changes. Apply incremental updates:
- Crawl changed page
- Diff against previous version
- Update posting lists for changed terms
- Merge small updates into main index periodically