Skip to main content

Design Twitter Search

Design a search system that allows users to search for tweets in real-time.

Related Concepts: Inverted Index, Full-Text Search, Real-Time Indexing, Elasticsearch/Lucene, Ranking Algorithm, Pagination (Cursor-Based), Sharding by Time, Result Diversity

Step 1: Requirements and Scope

Functional Requirements

  • Search tweets by keywords
  • Return results ranked by relevance and recency
  • Support filters (date range, from user, has media, etc.)
  • Handle hashtag and mention searches
  • Support phrase and boolean queries
  • Real-time indexing (new tweets searchable within seconds)

Non-Functional Requirements

RequirementTargetRationale
LatencyUnder 200ms p99Users expect instant results
FreshnessUnder 10 secondsBreaking news must be searchable
Availability99.99%Core feature
Scale500K queries/secondViral events spike traffic

Scale Estimation

  • 500 million tweets per day (6,000/second)
  • Average tweet: 280 chars + metadata = ~1KB
  • 30 days of searchable tweets = 15 billion tweets = 15 TB
  • 500,000 search queries per second at peak

Step 2: High-Level Design

Loading diagram...

Step 3: Core Design Decisions

Decision 1: Index Architecture

ApproachHow It WorksProsCons
Single Elasticsearch clusterOne cluster handles everythingSimpleScaling limits
Time-partitioned shardsEach shard covers a time windowEfficient queries, easy archivalCross-shard queries
Per-user shardingUser's tweets on one shardFast "from:user" queriesUnbalanced (celebrities)

Recommendation: Time-partitioned with "hot" shard for recent tweets.

Loading diagram...

Decision 2: Inverted Index Structure

For tweet: "I love coffee #MondayMotivation"

TermPosting List
"love"[tweet_123, tweet_456, tweet_789, ...]
"coffee"[tweet_123, tweet_234, ...]
"#mondaymotivation"[tweet_123, tweet_567, ...]
@username[tweet_890, ...]

Each posting entry contains:

  • Tweet ID - Unique identifier
  • Term frequency - How often term appears
  • Position in tweet - For phrase queries
  • Timestamp - For recency ranking

Decision 3: Ranking Algorithm

Ranking combines relevance and recency. The final score is a weighted sum of four components:

  • Relevance: TF-IDF or BM25 score measuring how well the tweet matches query terms
  • Recency: A decay function that decreases score based on tweet age
  • Engagement: Normalized count of likes, retweets, and replies
  • Authority: The author's follower count and verification status
FactorWeightRationale
Relevance0.4Must match query terms
Recency0.3Fresh content is important
Engagement0.2Social proof of quality
Authority0.1Verified/popular users

Decision 4: Real-Time Indexing

To achieve sub-10-second freshness:

Loading diagram...

Latency: Tweet created to searchable = 2-5 seconds

Step 4: Query Processing

Query Parsing

When a user types a query like coffee shop from:starbucks since:2024-01-01, the search system parses it into structured components:

  • Must clauses: Required matches, such as text containing "coffee shop"
  • Filter clauses: Exact constraints that narrow results without affecting relevance score, such as author equals "starbucks" and timestamp on or after January 1, 2024
  • Should clauses: Optional matches that boost relevance if present, such as hashtags containing "coffee"

Search Flow

Loading diagram...

Step 5: Handling Edge Cases

Viral Events (Spike Traffic)

When a major event happens (Super Bowl, breaking news):

  • Search volume can spike 100x
  • Same queries repeated millions of times

Solution: Query caching

Loading diagram...

Deleted/Hidden Tweets

Tweets get deleted, accounts get suspended:

ApproachLatencyComplexity
Re-index (remove from index)MinutesSimple
Bloom filter at query timeMillisecondsMedium
Mark as deleted, filter in resultsImmediateLow

Recommendation: Mark-and-filter for speed, periodic re-index for cleanup.

Step 6: Typeahead/Autocomplete

For search suggestions as users type:

Loading diagram...

Latency target: Under 50ms

Real-World Systems

SystemNotable Design Choice
Twitter/XReal-time indexing, engagement-based ranking
ElasticsearchInverted index, BM25 scoring
AlgoliaEdge search, typo tolerance
MeilisearchRust-based, instant search

Summary

DecisionRecommendationRationale
Index structureTime-partitioned shardsEfficient queries, easy archival
FreshnessReal-time with buffered writes2-5 second indexing latency
RankingBM25 + recency decay + engagementBalances relevance and freshness
Hot trafficQuery caching (30s TTL)Handles viral events
TypeaheadParallel sources + personalizationFast, relevant suggestions