Design a Search Engine

Build a search engine that can index billions of web pages and return relevant results in under 200ms.

Related Concepts: Database Scaling, Load Balancing, CDN

Step 1: Requirements and Scope

Functional Requirements

Requirement	Description
Web crawling	Discover and fetch web pages
Indexing	Process and store page content
Query processing	Parse and understand search queries
Ranking	Return relevant results ordered by quality
Snippet generation	Show previews of matching content

Non-Functional Requirements

Requirement	Target	Rationale
Latency	< 200ms p99	Users abandon slow searches
Freshness	Hours for news, days for general content	Stale results frustrate users
Availability	99.99%	Search is mission-critical
Scale	100B+ pages indexed	Cover the web

Scale Estimation

Web pages: 100 billion indexed pages
Average page size: 100KB (after HTML parsing)
Queries per day: 10 billion (115K QPS)
Index size: Compressed inverted index ~10PB
Crawl rate: 1 billion pages/day to stay fresh

Step 2: High-Level Architecture

Loading diagram...

Step 3: Web Crawling

The crawler discovers and fetches billions of pages efficiently.

URL Frontier Design

The frontier is a sophisticated scheduler, not a simple queue.

Loading diagram...

Politeness: One request per domain every few seconds to avoid overloading sites.

Priority: Important pages (news sites, popular domains) get crawled more frequently.

Deduplication: Use a Bloom filter for quick checks, then verify against the full URL database.

Crawling at Scale

Challenge	Solution
DNS bottleneck	Local DNS cache, batch lookups
robots.txt	Cache per domain, refresh weekly
Duplicates	URL normalization, content hashing
Spider traps	Max depth, max pages per domain
Freshness	Re-crawl frequency based on change rate

Step 4: Indexing

The index enables fast lookups instead of scanning 100 billion pages per query.

Inverted Index

Loading diagram...

For each word, store a list of documents containing it. Search becomes:

Look up each query term
Intersect the posting lists
Rank the results

Posting List Structure

Each entry in a posting list contains:

doc_id: The unique document identifier (such as 12345)
term_frequency: How many times the term appears in this document (such as 3)
positions: The word positions where the term occurs (such as positions 15, 89, and 234)
field: Which part of the document contains the term (title, body, or url)

Positions enable phrase queries (finding "quick brown fox" as a consecutive phrase). Field information enables field boosting (title matches rank higher than body matches).

Index Partitioning

100 billion documents require partitioning across machines.

Approach	Mechanism	Advantages	Disadvantages
Document partitioned	Each shard has all terms for a subset of docs	Query goes to all shards	All shards hit on every query
Term partitioned	Each shard has all docs for a subset of terms	Query only hits relevant shards	Popular terms cause hotspots

Recommendation: Document-partitioned. Simpler and handles popularity skew better.

Loading diagram...

Step 5: Query Processing

Query Understanding

Parse the query before hitting the index:

Loading diagram...

Spell correction: "restuarant" to "restaurant" using edit distance and query logs.

Query expansion: Add synonyms, handle acronyms (NYC to New York City).

Intent detection: "weather" likely wants current weather, not Wikipedia article.

Query Execution

Loading diagram...

Each shard returns its top-K results. The query processor merges these and performs final ranking.

Step 6: Ranking

Relevance Scoring

TF-IDF: Basic relevance metric.

Term Frequency: How often does this word appear in the document?
Inverse Document Frequency: How rare is this word across all documents?

The TF-IDF score multiplies the term frequency by the logarithm of the total document count divided by the number of documents containing that term. Rare terms contribute more to relevance than common terms.

BM25: Modern improvement over TF-IDF. Handles document length normalization and term frequency saturation.

PageRank and Link Analysis

Loading diagram...

PageRank: Pages linked to by important pages are themselves important. Run iteratively until scores converge.

Two-Phase Ranking

Loading diagram...

Expensive ML models cannot run on millions of candidates. Filter first with cheap signals, then rank carefully.

Ranking Signals

Signal Type	Examples
Query-document	BM25, title match, URL match
Document quality	PageRank, domain authority, spam score
Freshness	Last modified, crawl date
User engagement	Click-through rate, dwell time
Personalization	Location, search history, language

Step 7: Serving Infrastructure

Tiered Caching

Loading diagram...

Result cache: Store complete results for popular queries. "weather" and "facebook" are asked millions of times.

Posting list cache: Keep posting lists for common terms in memory.

Geographic Distribution

Users in Tokyo should not wait for a round trip to California.

Loading diagram...

Full index replicas in each major region. Route users to nearest data center.

Step 8: Freshness

Some content needs fast indexing.

Tiered Crawling

Tier	Content Type	Crawl Frequency	Index Latency
Real-time	Breaking news, social	Continuous	Seconds
Frequent	News sites, popular blogs	Hourly	Minutes
Normal	Most web pages	Daily-weekly	Hours
Archive	Old/static content	Monthly	Days

Incremental Indexing

Do not rebuild the entire index when one page changes. Apply incremental updates:

Crawl changed page
Diff against previous version
Update posting lists for changed terms
Merge small updates into main index periodically

Production Examples

System	Notable Design Choice
Google	Document-partitioned index, two-phase ranking, massive caching
Elasticsearch	Lucene-based, near real-time indexing, distributed by default
Apache Solr	Enterprise search, faceting, highlighting built-in
Bing	Heavy ML ranking, entity understanding

Summary: Key Design Decisions

Decision	Options	Recommendation
Index partitioning	Document vs term partitioned	Document partitioned
Initial ranking	BM25, TF-IDF	BM25 (better length normalization)
Final ranking	Rules, ML model	Two-phase with ML
Caching	Result, posting, document	All three tiers
Freshness	Batch rebuild, incremental	Incremental with tiered crawling

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Architecture​

Step 3: Web Crawling​

URL Frontier Design​

Crawling at Scale​

Step 4: Indexing​

Inverted Index​

Posting List Structure​

Index Partitioning​

Step 5: Query Processing​

Query Understanding​

Query Execution​

Step 6: Ranking​

Relevance Scoring​

PageRank and Link Analysis​

Two-Phase Ranking​

Ranking Signals​

Step 7: Serving Infrastructure​

Tiered Caching​

Geographic Distribution​

Step 8: Freshness​

Tiered Crawling​

Incremental Indexing​

Production Examples​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Architecture

Step 3: Web Crawling

URL Frontier Design

Crawling at Scale

Step 4: Indexing

Inverted Index

Posting List Structure

Index Partitioning

Step 5: Query Processing

Query Understanding

Query Execution

Step 6: Ranking

Relevance Scoring

PageRank and Link Analysis

Two-Phase Ranking

Ranking Signals

Step 7: Serving Infrastructure

Tiered Caching

Geographic Distribution

Step 8: Freshness

Tiered Crawling

Incremental Indexing

Production Examples

Summary: Key Design Decisions