Skip to main content

Design a Search Engine

Build a search engine that can index billions of web pages and return relevant results in under 200ms.

Related Concepts: Database Scaling, Load Balancing, CDN

Step 1: Requirements and Scope

Functional Requirements

RequirementDescription
Web crawlingDiscover and fetch web pages
IndexingProcess and store page content
Query processingParse and understand search queries
RankingReturn relevant results ordered by quality
Snippet generationShow previews of matching content

Non-Functional Requirements

RequirementTargetRationale
Latency< 200ms p99Users abandon slow searches
FreshnessHours for news, days for general contentStale results frustrate users
Availability99.99%Search is mission-critical
Scale100B+ pages indexedCover the web

Scale Estimation

  • Web pages: 100 billion indexed pages
  • Average page size: 100KB (after HTML parsing)
  • Queries per day: 10 billion (115K QPS)
  • Index size: Compressed inverted index ~10PB
  • Crawl rate: 1 billion pages/day to stay fresh

Step 2: High-Level Architecture

Loading diagram...

Step 3: Web Crawling

The crawler discovers and fetches billions of pages efficiently.

URL Frontier Design

The frontier is a sophisticated scheduler, not a simple queue.

Loading diagram...

Politeness: One request per domain every few seconds to avoid overloading sites.

Priority: Important pages (news sites, popular domains) get crawled more frequently.

Deduplication: Use a Bloom filter for quick checks, then verify against the full URL database.

Crawling at Scale

ChallengeSolution
DNS bottleneckLocal DNS cache, batch lookups
robots.txtCache per domain, refresh weekly
DuplicatesURL normalization, content hashing
Spider trapsMax depth, max pages per domain
FreshnessRe-crawl frequency based on change rate

Step 4: Indexing

The index enables fast lookups instead of scanning 100 billion pages per query.

Inverted Index

Loading diagram...

For each word, store a list of documents containing it. Search becomes:

  1. Look up each query term
  2. Intersect the posting lists
  3. Rank the results

Posting List Structure

Each entry in a posting list contains:

  • doc_id: The unique document identifier (such as 12345)
  • term_frequency: How many times the term appears in this document (such as 3)
  • positions: The word positions where the term occurs (such as positions 15, 89, and 234)
  • field: Which part of the document contains the term (title, body, or url)

Positions enable phrase queries (finding "quick brown fox" as a consecutive phrase). Field information enables field boosting (title matches rank higher than body matches).

Index Partitioning

100 billion documents require partitioning across machines.

ApproachMechanismAdvantagesDisadvantages
Document partitionedEach shard has all terms for a subset of docsQuery goes to all shardsAll shards hit on every query
Term partitionedEach shard has all docs for a subset of termsQuery only hits relevant shardsPopular terms cause hotspots

Recommendation: Document-partitioned. Simpler and handles popularity skew better.

Loading diagram...

Step 5: Query Processing

Query Understanding

Parse the query before hitting the index:

Loading diagram...

Spell correction: "restuarant" to "restaurant" using edit distance and query logs.

Query expansion: Add synonyms, handle acronyms (NYC to New York City).

Intent detection: "weather" likely wants current weather, not Wikipedia article.

Query Execution

Loading diagram...

Each shard returns its top-K results. The query processor merges these and performs final ranking.

Step 6: Ranking

Relevance Scoring

TF-IDF: Basic relevance metric.

  • Term Frequency: How often does this word appear in the document?
  • Inverse Document Frequency: How rare is this word across all documents?

The TF-IDF score multiplies the term frequency by the logarithm of the total document count divided by the number of documents containing that term. Rare terms contribute more to relevance than common terms.

BM25: Modern improvement over TF-IDF. Handles document length normalization and term frequency saturation.

Loading diagram...

PageRank: Pages linked to by important pages are themselves important. Run iteratively until scores converge.

Two-Phase Ranking

Loading diagram...

Expensive ML models cannot run on millions of candidates. Filter first with cheap signals, then rank carefully.

Ranking Signals

Signal TypeExamples
Query-documentBM25, title match, URL match
Document qualityPageRank, domain authority, spam score
FreshnessLast modified, crawl date
User engagementClick-through rate, dwell time
PersonalizationLocation, search history, language

Step 7: Serving Infrastructure

Tiered Caching

Loading diagram...

Result cache: Store complete results for popular queries. "weather" and "facebook" are asked millions of times.

Posting list cache: Keep posting lists for common terms in memory.

Geographic Distribution

Users in Tokyo should not wait for a round trip to California.

Loading diagram...

Full index replicas in each major region. Route users to nearest data center.

Step 8: Freshness

Some content needs fast indexing.

Tiered Crawling

TierContent TypeCrawl FrequencyIndex Latency
Real-timeBreaking news, socialContinuousSeconds
FrequentNews sites, popular blogsHourlyMinutes
NormalMost web pagesDaily-weeklyHours
ArchiveOld/static contentMonthlyDays

Incremental Indexing

Do not rebuild the entire index when one page changes. Apply incremental updates:

  1. Crawl changed page
  2. Diff against previous version
  3. Update posting lists for changed terms
  4. Merge small updates into main index periodically

Production Examples

SystemNotable Design Choice
GoogleDocument-partitioned index, two-phase ranking, massive caching
ElasticsearchLucene-based, near real-time indexing, distributed by default
Apache SolrEnterprise search, faceting, highlighting built-in
BingHeavy ML ranking, entity understanding

Summary: Key Design Decisions

DecisionOptionsRecommendation
Index partitioningDocument vs term partitionedDocument partitioned
Initial rankingBM25, TF-IDFBM25 (better length normalization)
Final rankingRules, ML modelTwo-phase with ML
CachingResult, posting, documentAll three tiers
FreshnessBatch rebuild, incrementalIncremental with tiered crawling