Design a web crawler that systematically browses the internet to index web pages for a search engine like Google or Bing.
Related Concepts: URL Frontier (Priority Queue), BFS vs. DFS Crawling, Politeness (robots.txt), URL Deduplication (Bloom Filter), Content Hashing (SimHash), Distributed Workers, DNS Caching, Checkpointing
- Given seed URLs, discover and download web pages
- Extract URLs from downloaded pages to find new pages
- Store page content for search engine indexing
- Handle different content types (HTML, images, PDFs)
- Respect robots.txt and crawl-delay directives
- Detect and handle duplicate content
| Requirement | Target | Rationale |
|---|
| Scalability | 1 billion pages/month | Internet scale |
| Politeness | Max 1 req/sec per domain | Prevent overloading websites |
| Robustness | Handle any website | Malformed HTML, timeouts, loops |
| Freshness | Recrawl changed pages | Search results must be current |
| Extensibility | Add content types easily | PDFs, images, videos |
- 1 billion pages per month
- ~400 pages per second average
- Average page size: 500 KB
- Raw storage: 500 TB/month
- Compressed storage: ~100 TB/month
- Bandwidth: ~200 MB/s (1.6 Gbps)
Loading diagram...
The frontier is the scheduling component that determines which URLs to crawl and when.
Loading diagram...
| Factor | Weight | Description |
|---|
| PageRank | 40% | Link popularity |
| Freshness | 25% | How often page changes |
| Domain authority | 20% | Trust of the domain |
| Depth from seed | 15% | How many hops from seed URL |
| Rule | Implementation | Purpose |
|---|
| robots.txt | Parse and cache per domain | Respect site preferences |
| crawl-delay | Honor specified delay | Prevent server overload |
| Rate limiting | Max 1 req/sec per domain | Default politeness |
| Concurrent limit | Max 1 connection per host | Prevent connection exhaustion |
Loading diagram...
| Before | After | Rule |
|---|
HTTP://Example.COM/Page | example.com/page | Lowercase |
example.com/page/ | example.com/page | Remove trailing slash |
example.com/page#section | example.com/page | Remove fragment |
example.com/page?b=2&a=1 | example.com/page?a=1&b=2 | Sort params |
example.com/./page | example.com/page | Resolve paths |
| URLs | False Positive Rate | Size |
|---|
| 1 billion | 1% | 1.2 GB |
| 1 billion | 0.1% | 1.8 GB |
| 10 billion | 1% | 12 GB |
False positives result in skipping some URLs, but the memory savings justify this trade-off.
Step 5: Content Deduplication
Different URLs can serve identical or near-identical content.
| Method | How It Works | Pros | Cons |
|---|
| MD5/SHA Hash | Exact content hash | Fast, accurate for exact duplicates | Misses near-duplicates |
| SimHash | Locality-sensitive hash | Finds similar pages | More computation |
| MinHash | Jaccard similarity estimate | Good for text similarity | Requires shingles |
| DOM fingerprint | Hash of DOM structure | Detects template pages | Content-agnostic |
Loading diagram...
| Trap Type | Example | Detection | Mitigation |
|---|
| Infinite calendars | /calendar/2024/01/01, /2024/01/02, ... | URL pattern matching | Limit per pattern |
| Session IDs | /page?sid=abc123, /page?sid=def456 | Parameter detection | Remove session params |
| Soft 404s | 200 OK with "Page not found" content | Content analysis | Detect error patterns |
| Infinitely deep | /a/b/c/d/e/f/g/... | Path depth tracking | Max depth limit |
| URL parameters | /search?q=a, /search?q=b, ... | Parameter explosion | Limit unique params |
| Strategy | Limit | Rationale |
|---|
| Max URL length | 2048 chars | Reasonable URL size |
| Max path depth | 16 levels | Most real content is shallower |
| Max URLs per domain | 1 million | Prevent single-domain explosion |
| Max URLs per pattern | 1000 | Detect repetitive patterns |
| Timeout per page | 30 seconds | Prevent indefinite waits |
Loading diagram...
Domain Sharding Benefits
| Benefit | Explanation |
|---|
| Politeness | Each domain handled by one worker |
| Caching | robots.txt cached locally |
| Connection reuse | Keep-alive per domain |
| Rate limiting | Simple local counters |
DNS lookups (10-200ms each) become a bottleneck at scale.
| Strategy | Latency Reduction | Implementation |
|---|
| Local DNS cache | 90%+ cache hit | TTL-based caching |
| Prefetching | Hide latency | Resolve URLs in frontier ahead of time |
| Custom DNS servers | Lower latency | Deploy close to crawlers |
| Batch resolution | Amortize overhead | Resolve multiple domains together |
| Strategy | How It Works | Best For |
|---|
| Uniform | Recrawl all pages on fixed schedule | Simple, comprehensive |
| Adaptive | Crawl frequently-changing pages more often | News sites, dynamic content |
| Priority-based | Important pages first | Limited resources |
| On-demand | Triggered by signals (social, notifications) | Breaking news |
Loading diagram...
| Page Type | Typical Change Frequency | Recrawl Interval |
|---|
| News homepage | Minutes | 5-15 minutes |
| Blog posts | Days | 1-7 days |
| Product pages | Weekly | 1-4 weeks |
| Static pages | Rarely | Monthly |
Content Storage
| Storage | Use Case | Characteristics |
|---|
| S3/GCS | Raw page content | Cheap, durable, slow access |
| HDFS | Batch processing | Good for MapReduce, Spark |
| Elasticsearch | Searchable content | Indexed, fast queries |
| Field | Type | Purpose |
|---|
url_hash | BINARY(16) | Primary key (MD5 of URL) |
url | TEXT | Original URL |
content_hash | BINARY(16) | Deduplication |
last_crawled | TIMESTAMP | Freshness tracking |
next_crawl | TIMESTAMP | Scheduling |
status_code | INT | Last response code |
change_count | INT | Historical changes |
| Crawler | Scale | Notable Features |
|---|
| Googlebot | Billions/day | ML-based scheduling |
| Bingbot | Billions/day | Strong on fresh content |
| Common Crawl | 3B pages/month | Open dataset, research-focused |
| Scrapy | Open source | Python-based, extensible |
| Apache Nutch | Open source | Hadoop-based, distributed |
| Decision | Options | Recommendation |
|---|
| URL deduplication | Database, Bloom filter | Bloom filter for memory efficiency |
| Politeness | Global rate limit, Per-domain queues | Per-domain queues with crawl-delay |
| Content dedup | Hash, SimHash, MinHash | SimHash for near-duplicates |
| Distribution | Centralized, Domain sharding | Domain sharding for simplicity |
| Freshness | Fixed, Adaptive | Adaptive based on change frequency |
| Storage | Database, Object store | Object store (S3) with metadata DB |