Design Twitter Search
Design a search system that allows users to search for tweets in real-time.
Related Concepts: Inverted Index, Full-Text Search, Real-Time Indexing, Elasticsearch/Lucene, Ranking Algorithm, Pagination (Cursor-Based), Sharding by Time, Result Diversity
Step 1: Requirements and Scope
Functional Requirements
- Search tweets by keywords
- Return results ranked by relevance and recency
- Support filters (date range, from user, has media, etc.)
- Handle hashtag and mention searches
- Support phrase and boolean queries
- Real-time indexing (new tweets searchable within seconds)
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | Under 200ms p99 | Users expect instant results |
| Freshness | Under 10 seconds | Breaking news must be searchable |
| Availability | 99.99% | Core feature |
| Scale | 500K queries/second | Viral events spike traffic |
Scale Estimation
- 500 million tweets per day (6,000/second)
- Average tweet: 280 chars + metadata = ~1KB
- 30 days of searchable tweets = 15 billion tweets = 15 TB
- 500,000 search queries per second at peak
Step 2: High-Level Design
Step 3: Core Design Decisions
Decision 1: Index Architecture
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Single Elasticsearch cluster | One cluster handles everything | Simple | Scaling limits |
| Time-partitioned shards | Each shard covers a time window | Efficient queries, easy archival | Cross-shard queries |
| Per-user sharding | User's tweets on one shard | Fast "from:user" queries | Unbalanced (celebrities) |
Recommendation: Time-partitioned with "hot" shard for recent tweets.
Decision 2: Inverted Index Structure
For tweet: "I love coffee #MondayMotivation"
| Term | Posting List |
|---|---|
| "love" | [tweet_123, tweet_456, tweet_789, ...] |
| "coffee" | [tweet_123, tweet_234, ...] |
| "#mondaymotivation" | [tweet_123, tweet_567, ...] |
| @username | [tweet_890, ...] |
Each posting entry contains:
- Tweet ID - Unique identifier
- Term frequency - How often term appears
- Position in tweet - For phrase queries
- Timestamp - For recency ranking
Decision 3: Ranking Algorithm
Ranking combines relevance and recency. The final score is a weighted sum of four components:
- Relevance: TF-IDF or BM25 score measuring how well the tweet matches query terms
- Recency: A decay function that decreases score based on tweet age
- Engagement: Normalized count of likes, retweets, and replies
- Authority: The author's follower count and verification status
| Factor | Weight | Rationale |
|---|---|---|
| Relevance | 0.4 | Must match query terms |
| Recency | 0.3 | Fresh content is important |
| Engagement | 0.2 | Social proof of quality |
| Authority | 0.1 | Verified/popular users |
Decision 4: Real-Time Indexing
To achieve sub-10-second freshness:
Latency: Tweet created to searchable = 2-5 seconds
Step 4: Query Processing
Query Parsing
When a user types a query like coffee shop from:starbucks since:2024-01-01, the search system parses it into structured components:
- Must clauses: Required matches, such as text containing "coffee shop"
- Filter clauses: Exact constraints that narrow results without affecting relevance score, such as author equals "starbucks" and timestamp on or after January 1, 2024
- Should clauses: Optional matches that boost relevance if present, such as hashtags containing "coffee"
Search Flow
Step 5: Handling Edge Cases
Viral Events (Spike Traffic)
When a major event happens (Super Bowl, breaking news):
- Search volume can spike 100x
- Same queries repeated millions of times
Solution: Query caching
Deleted/Hidden Tweets
Tweets get deleted, accounts get suspended:
| Approach | Latency | Complexity |
|---|---|---|
| Re-index (remove from index) | Minutes | Simple |
| Bloom filter at query time | Milliseconds | Medium |
| Mark as deleted, filter in results | Immediate | Low |
Recommendation: Mark-and-filter for speed, periodic re-index for cleanup.
Step 6: Typeahead/Autocomplete
For search suggestions as users type:
Latency target: Under 50ms
Real-World Systems
| System | Notable Design Choice |
|---|---|
| Twitter/X | Real-time indexing, engagement-based ranking |
| Elasticsearch | Inverted index, BM25 scoring |
| Algolia | Edge search, typo tolerance |
| Meilisearch | Rust-based, instant search |
Summary
| Decision | Recommendation | Rationale |
|---|---|---|
| Index structure | Time-partitioned shards | Efficient queries, easy archival |
| Freshness | Real-time with buffered writes | 2-5 second indexing latency |
| Ranking | BM25 + recency decay + engagement | Balances relevance and freshness |
| Hot traffic | Query caching (30s TTL) | Handles viral events |
| Typeahead | Parallel sources + personalization | Fast, relevant suggestions |