Design a Web Crawler
Design a web crawler that systematically browses the internet to index web pages for a search engine like Google or Bing.
Related Concepts: URL Frontier (Priority Queue), BFS vs. DFS Crawling, Politeness (robots.txt), URL Deduplication (Bloom Filter), Content Hashing (SimHash), Distributed Workers, DNS Caching, Checkpointing
Step 1: Requirements and Scope
Functional Requirements
- Given seed URLs, discover and download web pages
- Extract URLs from downloaded pages to find new pages
- Store page content for search engine indexing
- Handle different content types (HTML, images, PDFs)
- Respect robots.txt and crawl-delay directives
- Detect and handle duplicate content
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Scalability | 1 billion pages/month | Internet scale |
| Politeness | Max 1 req/sec per domain | Prevent overloading websites |
| Robustness | Handle any website | Malformed HTML, timeouts, loops |
| Freshness | Recrawl changed pages | Search results must be current |
| Extensibility | Add content types easily | PDFs, images, videos |
Scale Estimation
- 1 billion pages per month
- ~400 pages per second average
- Average page size: 500 KB
- Raw storage: 500 TB/month
- Compressed storage: ~100 TB/month
- Bandwidth: ~200 MB/s (1.6 Gbps)
Step 2: High-Level Architecture
Loading diagram...
Step 3: URL Frontier Design
The frontier is the scheduling component that determines which URLs to crawl and when.
Two-Layer Architecture
Loading diagram...
Priority Scoring
| Factor | Weight | Description |
|---|---|---|
| PageRank | 40% | Link popularity |
| Freshness | 25% | How often page changes |
| Domain authority | 20% | Trust of the domain |
| Depth from seed | 15% | How many hops from seed URL |
Politeness Rules
| Rule | Implementation | Purpose |
|---|---|---|
| robots.txt | Parse and cache per domain | Respect site preferences |
| crawl-delay | Honor specified delay | Prevent server overload |
| Rate limiting | Max 1 req/sec per domain | Default politeness |
| Concurrent limit | Max 1 connection per host | Prevent connection exhaustion |
Step 4: URL Deduplication
Bloom Filter Approach
Loading diagram...
URL Normalization
| Before | After | Rule |
|---|---|---|
HTTP://Example.COM/Page | example.com/page | Lowercase |
example.com/page/ | example.com/page | Remove trailing slash |
example.com/page#section | example.com/page | Remove fragment |
example.com/page?b=2&a=1 | example.com/page?a=1&b=2 | Sort params |
example.com/./page | example.com/page | Resolve paths |
Bloom Filter Sizing
| URLs | False Positive Rate | Size |
|---|---|---|
| 1 billion | 1% | 1.2 GB |
| 1 billion | 0.1% | 1.8 GB |
| 10 billion | 1% | 12 GB |
False positives result in skipping some URLs, but the memory savings justify this trade-off.
Step 5: Content Deduplication
Different URLs can serve identical or near-identical content.
Detection Methods
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| MD5/SHA Hash | Exact content hash | Fast, accurate for exact duplicates | Misses near-duplicates |
| SimHash | Locality-sensitive hash | Finds similar pages | More computation |
| MinHash | Jaccard similarity estimate | Good for text similarity | Requires shingles |
| DOM fingerprint | Hash of DOM structure | Detects template pages | Content-agnostic |
SimHash for Near-Duplicate Detection
Loading diagram...
Step 6: Handling Crawler Traps
Types of Traps
| Trap Type | Example | Detection | Mitigation |
|---|---|---|---|
| Infinite calendars | /calendar/2024/01/01, /2024/01/02, ... | URL pattern matching | Limit per pattern |
| Session IDs | /page?sid=abc123, /page?sid=def456 | Parameter detection | Remove session params |
| Soft 404s | 200 OK with "Page not found" content | Content analysis | Detect error patterns |
| Infinitely deep | /a/b/c/d/e/f/g/... | Path depth tracking | Max depth limit |
| URL parameters | /search?q=a, /search?q=b, ... | Parameter explosion | Limit unique params |
Protection Strategies
| Strategy | Limit | Rationale |
|---|---|---|
| Max URL length | 2048 chars | Reasonable URL size |
| Max path depth | 16 levels | Most real content is shallower |
| Max URLs per domain | 1 million | Prevent single-domain explosion |
| Max URLs per pattern | 1000 | Detect repetitive patterns |
| Timeout per page | 30 seconds | Prevent indefinite waits |
Step 7: Distributed Architecture
Worker Distribution
Loading diagram...
Domain Sharding Benefits
| Benefit | Explanation |
|---|---|
| Politeness | Each domain handled by one worker |
| Caching | robots.txt cached locally |
| Connection reuse | Keep-alive per domain |
| Rate limiting | Simple local counters |