Skip to main content

Design a Web Crawler

Design a web crawler that systematically browses the internet to index web pages for a search engine like Google or Bing.

Related Concepts: URL Frontier (Priority Queue), BFS vs. DFS Crawling, Politeness (robots.txt), URL Deduplication (Bloom Filter), Content Hashing (SimHash), Distributed Workers, DNS Caching, Checkpointing

Step 1: Requirements and Scope

Functional Requirements

  • Given seed URLs, discover and download web pages
  • Extract URLs from downloaded pages to find new pages
  • Store page content for search engine indexing
  • Handle different content types (HTML, images, PDFs)
  • Respect robots.txt and crawl-delay directives
  • Detect and handle duplicate content

Non-Functional Requirements

RequirementTargetRationale
Scalability1 billion pages/monthInternet scale
PolitenessMax 1 req/sec per domainPrevent overloading websites
RobustnessHandle any websiteMalformed HTML, timeouts, loops
FreshnessRecrawl changed pagesSearch results must be current
ExtensibilityAdd content types easilyPDFs, images, videos

Scale Estimation

  • 1 billion pages per month
  • ~400 pages per second average
  • Average page size: 500 KB
  • Raw storage: 500 TB/month
  • Compressed storage: ~100 TB/month
  • Bandwidth: ~200 MB/s (1.6 Gbps)

Step 2: High-Level Architecture

Loading diagram...

Step 3: URL Frontier Design

The frontier is the scheduling component that determines which URLs to crawl and when.

Two-Layer Architecture

Loading diagram...

Priority Scoring

FactorWeightDescription
PageRank40%Link popularity
Freshness25%How often page changes
Domain authority20%Trust of the domain
Depth from seed15%How many hops from seed URL

Politeness Rules

RuleImplementationPurpose
robots.txtParse and cache per domainRespect site preferences
crawl-delayHonor specified delayPrevent server overload
Rate limitingMax 1 req/sec per domainDefault politeness
Concurrent limitMax 1 connection per hostPrevent connection exhaustion

Step 4: URL Deduplication

Bloom Filter Approach

Loading diagram...

URL Normalization

BeforeAfterRule
HTTP://Example.COM/Pageexample.com/pageLowercase
example.com/page/example.com/pageRemove trailing slash
example.com/page#sectionexample.com/pageRemove fragment
example.com/page?b=2&a=1example.com/page?a=1&b=2Sort params
example.com/./pageexample.com/pageResolve paths

Bloom Filter Sizing

URLsFalse Positive RateSize
1 billion1%1.2 GB
1 billion0.1%1.8 GB
10 billion1%12 GB

False positives result in skipping some URLs, but the memory savings justify this trade-off.

Step 5: Content Deduplication

Different URLs can serve identical or near-identical content.

Detection Methods

MethodHow It WorksProsCons
MD5/SHA HashExact content hashFast, accurate for exact duplicatesMisses near-duplicates
SimHashLocality-sensitive hashFinds similar pagesMore computation
MinHashJaccard similarity estimateGood for text similarityRequires shingles
DOM fingerprintHash of DOM structureDetects template pagesContent-agnostic

SimHash for Near-Duplicate Detection

Loading diagram...

Step 6: Handling Crawler Traps

Types of Traps

Trap TypeExampleDetectionMitigation
Infinite calendars/calendar/2024/01/01, /2024/01/02, ...URL pattern matchingLimit per pattern
Session IDs/page?sid=abc123, /page?sid=def456Parameter detectionRemove session params
Soft 404s200 OK with "Page not found" contentContent analysisDetect error patterns
Infinitely deep/a/b/c/d/e/f/g/...Path depth trackingMax depth limit
URL parameters/search?q=a, /search?q=b, ...Parameter explosionLimit unique params

Protection Strategies

StrategyLimitRationale
Max URL length2048 charsReasonable URL size
Max path depth16 levelsMost real content is shallower
Max URLs per domain1 millionPrevent single-domain explosion
Max URLs per pattern1000Detect repetitive patterns
Timeout per page30 secondsPrevent indefinite waits

Step 7: Distributed Architecture

Worker Distribution

Loading diagram...

Domain Sharding Benefits

BenefitExplanation
PolitenessEach domain handled by one worker
Cachingrobots.txt cached locally
Connection reuseKeep-alive per domain
Rate limitingSimple local counters

Step 8: DNS Resolution

DNS lookups (10-200ms each) become a bottleneck at scale.

Optimization Strategies

StrategyLatency ReductionImplementation
Local DNS cache90%+ cache hitTTL-based caching
PrefetchingHide latencyResolve URLs in frontier ahead of time
Custom DNS serversLower latencyDeploy close to crawlers
Batch resolutionAmortize overheadResolve multiple domains together

Step 9: Freshness and Recrawling

Recrawl Strategies

StrategyHow It WorksBest For
UniformRecrawl all pages on fixed scheduleSimple, comprehensive
AdaptiveCrawl frequently-changing pages more oftenNews sites, dynamic content
Priority-basedImportant pages firstLimited resources
On-demandTriggered by signals (social, notifications)Breaking news

Estimating Change Frequency

Loading diagram...
Page TypeTypical Change FrequencyRecrawl Interval
News homepageMinutes5-15 minutes
Blog postsDays1-7 days
Product pagesWeekly1-4 weeks
Static pagesRarelyMonthly

Step 10: Storage Design

Content Storage

StorageUse CaseCharacteristics
S3/GCSRaw page contentCheap, durable, slow access
HDFSBatch processingGood for MapReduce, Spark
ElasticsearchSearchable contentIndexed, fast queries

Metadata Storage

FieldTypePurpose
url_hashBINARY(16)Primary key (MD5 of URL)
urlTEXTOriginal URL
content_hashBINARY(16)Deduplication
last_crawledTIMESTAMPFreshness tracking
next_crawlTIMESTAMPScheduling
status_codeINTLast response code
change_countINTHistorical changes

Real-World Systems

CrawlerScaleNotable Features
GooglebotBillions/dayML-based scheduling
BingbotBillions/dayStrong on fresh content
Common Crawl3B pages/monthOpen dataset, research-focused
ScrapyOpen sourcePython-based, extensible
Apache NutchOpen sourceHadoop-based, distributed

Summary: Key Design Decisions

DecisionOptionsRecommendation
URL deduplicationDatabase, Bloom filterBloom filter for memory efficiency
PolitenessGlobal rate limit, Per-domain queuesPer-domain queues with crawl-delay
Content dedupHash, SimHash, MinHashSimHash for near-duplicates
DistributionCentralized, Domain shardingDomain sharding for simplicity
FreshnessFixed, AdaptiveAdaptive based on change frequency
StorageDatabase, Object storeObject store (S3) with metadata DB