Design a Web Crawler

Design a web crawler that systematically browses the internet to index web pages for a search engine like Google or Bing.

Related Concepts: URL Frontier (Priority Queue), BFS vs. DFS Crawling, Politeness (robots.txt), URL Deduplication (Bloom Filter), Content Hashing (SimHash), Distributed Workers, DNS Caching, Checkpointing

Step 1: Requirements and Scope

Functional Requirements

Given seed URLs, discover and download web pages
Extract URLs from downloaded pages to find new pages
Store page content for search engine indexing
Handle different content types (HTML, images, PDFs)
Respect robots.txt and crawl-delay directives
Detect and handle duplicate content

Non-Functional Requirements

Requirement	Target	Rationale
Scalability	1 billion pages/month	Internet scale
Politeness	Max 1 req/sec per domain	Prevent overloading websites
Robustness	Handle any website	Malformed HTML, timeouts, loops
Freshness	Recrawl changed pages	Search results must be current
Extensibility	Add content types easily	PDFs, images, videos

Scale Estimation

1 billion pages per month
~400 pages per second average
Average page size: 500 KB
Raw storage: 500 TB/month
Compressed storage: ~100 TB/month
Bandwidth: ~200 MB/s (1.6 Gbps)

Step 2: High-Level Architecture

Loading diagram...

Step 3: URL Frontier Design

The frontier is the scheduling component that determines which URLs to crawl and when.

Two-Layer Architecture

Loading diagram...

Priority Scoring

Factor	Weight	Description
PageRank	40%	Link popularity
Freshness	25%	How often page changes
Domain authority	20%	Trust of the domain
Depth from seed	15%	How many hops from seed URL

Politeness Rules

Rule	Implementation	Purpose
robots.txt	Parse and cache per domain	Respect site preferences
crawl-delay	Honor specified delay	Prevent server overload
Rate limiting	Max 1 req/sec per domain	Default politeness
Concurrent limit	Max 1 connection per host	Prevent connection exhaustion

Step 4: URL Deduplication

Bloom Filter Approach

Loading diagram...

URL Normalization

Before	After	Rule
`HTTP://Example.COM/Page`	`example.com/page`	Lowercase
`example.com/page/`	`example.com/page`	Remove trailing slash
`example.com/page#section`	`example.com/page`	Remove fragment
`example.com/page?b=2&a=1`	`example.com/page?a=1&b=2`	Sort params
`example.com/./page`	`example.com/page`	Resolve paths

Bloom Filter Sizing

URLs	False Positive Rate	Size
1 billion	1%	1.2 GB
1 billion	0.1%	1.8 GB
10 billion	1%	12 GB

False positives result in skipping some URLs, but the memory savings justify this trade-off.

Step 5: Content Deduplication

Different URLs can serve identical or near-identical content.

Detection Methods

Method	How It Works	Pros	Cons
MD5/SHA Hash	Exact content hash	Fast, accurate for exact duplicates	Misses near-duplicates
SimHash	Locality-sensitive hash	Finds similar pages	More computation
MinHash	Jaccard similarity estimate	Good for text similarity	Requires shingles
DOM fingerprint	Hash of DOM structure	Detects template pages	Content-agnostic

SimHash for Near-Duplicate Detection

Loading diagram...

Step 6: Handling Crawler Traps

Types of Traps

Trap Type	Example	Detection	Mitigation
Infinite calendars	`/calendar/2024/01/01`, `/2024/01/02`, ...	URL pattern matching	Limit per pattern
Session IDs	`/page?sid=abc123`, `/page?sid=def456`	Parameter detection	Remove session params
Soft 404s	200 OK with "Page not found" content	Content analysis	Detect error patterns
Infinitely deep	`/a/b/c/d/e/f/g/...`	Path depth tracking	Max depth limit
URL parameters	`/search?q=a`, `/search?q=b`, ...	Parameter explosion	Limit unique params

Protection Strategies

Strategy	Limit	Rationale
Max URL length	2048 chars	Reasonable URL size
Max path depth	16 levels	Most real content is shallower
Max URLs per domain	1 million	Prevent single-domain explosion
Max URLs per pattern	1000	Detect repetitive patterns
Timeout per page	30 seconds	Prevent indefinite waits

Step 7: Distributed Architecture

Worker Distribution

Loading diagram...

Domain Sharding Benefits

Benefit	Explanation
Politeness	Each domain handled by one worker
Caching	robots.txt cached locally
Connection reuse	Keep-alive per domain
Rate limiting	Simple local counters

Step 8: DNS Resolution

DNS lookups (10-200ms each) become a bottleneck at scale.

Optimization Strategies

Strategy	Latency Reduction	Implementation
Local DNS cache	90%+ cache hit	TTL-based caching
Prefetching	Hide latency	Resolve URLs in frontier ahead of time
Custom DNS servers	Lower latency	Deploy close to crawlers
Batch resolution	Amortize overhead	Resolve multiple domains together

Step 9: Freshness and Recrawling

Recrawl Strategies

Strategy	How It Works	Best For
Uniform	Recrawl all pages on fixed schedule	Simple, comprehensive
Adaptive	Crawl frequently-changing pages more often	News sites, dynamic content
Priority-based	Important pages first	Limited resources
On-demand	Triggered by signals (social, notifications)	Breaking news

Estimating Change Frequency

Loading diagram...

Page Type	Typical Change Frequency	Recrawl Interval
News homepage	Minutes	5-15 minutes
Blog posts	Days	1-7 days
Product pages	Weekly	1-4 weeks
Static pages	Rarely	Monthly

Step 10: Storage Design

Content Storage

Storage	Use Case	Characteristics
S3/GCS	Raw page content	Cheap, durable, slow access
HDFS	Batch processing	Good for MapReduce, Spark
Elasticsearch	Searchable content	Indexed, fast queries

Metadata Storage

Field	Type	Purpose
`url_hash`	BINARY(16)	Primary key (MD5 of URL)
`url`	TEXT	Original URL
`content_hash`	BINARY(16)	Deduplication
`last_crawled`	TIMESTAMP	Freshness tracking
`next_crawl`	TIMESTAMP	Scheduling
`status_code`	INT	Last response code
`change_count`	INT	Historical changes

Real-World Systems

Crawler	Scale	Notable Features
Googlebot	Billions/day	ML-based scheduling
Bingbot	Billions/day	Strong on fresh content
Common Crawl	3B pages/month	Open dataset, research-focused
Scrapy	Open source	Python-based, extensible
Apache Nutch	Open source	Hadoop-based, distributed

Summary: Key Design Decisions

Decision	Options	Recommendation
URL deduplication	Database, Bloom filter	Bloom filter for memory efficiency
Politeness	Global rate limit, Per-domain queues	Per-domain queues with crawl-delay
Content dedup	Hash, SimHash, MinHash	SimHash for near-duplicates
Distribution	Centralized, Domain sharding	Domain sharding for simplicity
Freshness	Fixed, Adaptive	Adaptive based on change frequency
Storage	Database, Object store	Object store (S3) with metadata DB

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Architecture​

Step 3: URL Frontier Design​

Two-Layer Architecture​

Priority Scoring​

Politeness Rules​

Step 4: URL Deduplication​

Bloom Filter Approach​

URL Normalization​

Bloom Filter Sizing​

Step 5: Content Deduplication​

Detection Methods​

SimHash for Near-Duplicate Detection​

Step 6: Handling Crawler Traps​

Types of Traps​

Protection Strategies​

Step 7: Distributed Architecture​

Worker Distribution​

Domain Sharding Benefits​

Step 8: DNS Resolution​

Optimization Strategies​

Step 9: Freshness and Recrawling​

Recrawl Strategies​

Estimating Change Frequency​

Step 10: Storage Design​

Content Storage​

Metadata Storage​

Real-World Systems​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Architecture

Step 3: URL Frontier Design

Two-Layer Architecture

Priority Scoring

Politeness Rules

Step 4: URL Deduplication

Bloom Filter Approach

URL Normalization

Bloom Filter Sizing

Step 5: Content Deduplication

Detection Methods

SimHash for Near-Duplicate Detection

Step 6: Handling Crawler Traps

Types of Traps

Protection Strategies

Step 7: Distributed Architecture

Worker Distribution

Domain Sharding Benefits

Step 8: DNS Resolution

Optimization Strategies

Step 9: Freshness and Recrawling

Recrawl Strategies

Estimating Change Frequency

Step 10: Storage Design

Content Storage

Metadata Storage

Real-World Systems

Summary: Key Design Decisions