Design a Key-Value Store
Design a distributed key-value store that supports get(key) and put(key, value) operations at massive scale.
Related Concepts: Consistent Hashing, Replication, CAP Theorem, Quorum Consensus, Vector Clocks, Gossip Protocol, LSM Trees, Merkle Trees, Anti-Entropy, Sloppy Quorum
Step 1: Requirements and Scope
Functional Requirements
put(key, value)- Store a key-value pairget(key)- Retrieve value by key- Keys and values are strings
- Values can be large (up to several KB)
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Availability | 99.99% uptime | Users expect always-on access |
| Latency | < 10ms p99 | Real-time applications need fast responses |
| Scalability | Petabytes of data | Data grows continuously |
| Durability | No data loss | Data is the business |
| Consistency | Tunable | Different use cases need different guarantees |
Scale Estimation
- 10 billion key-value pairs
- Average key: 100 bytes, average value: 10 KB
- Storage: 10B x 10KB = 100 TB
- Read:write ratio: 10:1
- 100K requests/second
Step 2: High-Level Design
Key Components:
- Coordinator nodes: Route requests to the correct storage nodes
- Storage nodes: Store the actual key-value data
- Partitions: Data is split across nodes using consistent hashing
Step 3: Core Design Decisions
Decision 1: CAP Theorem - CP vs AP
The CAP theorem states you can only guarantee two of three properties during network partitions:
- Consistency: All clients see the same data
- Availability: System always responds
- Partition tolerance: System works despite network failures
| System Type | Guarantees | Sacrifices | Use Cases |
|---|---|---|---|
| CP System | Consistency + Partition Tolerance | Availability during partitions | Banking, inventory, booking systems |
| AP System | Availability + Partition Tolerance | Strong consistency | Social media, caching, session storage |
Real-world examples:
- CP: Google Spanner, HBase, MongoDB (default)
- AP: DynamoDB, Cassandra, Riak
Recommendation for most use cases: AP with eventual consistency, because:
- Network partitions are rare but happen
- Brief inconsistency is often acceptable
- Availability impacts user experience directly
Decision 2: Data Partitioning Strategy
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Range-based | Keys A-M on Node 1, N-Z on Node 2 | Good for range queries | Hot spots if keys are sequential |
| Hash-based | hash(key) % N selects Node | Even distribution | Adding nodes requires full reshuffle |
| Consistent hashing | Keys map to ring positions | Minimal reshuffling on node changes | More complex to implement |
Recommendation: Consistent hashing with virtual nodes
- When adding/removing nodes, only K/N keys need to move (K=total keys, N=nodes)
- Virtual nodes ensure even distribution despite physical node differences
Decision 3: Replication Strategy
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Single leader | All writes to one node, replicate to followers | Simple, consistent | Leader is bottleneck/SPOF |
| Multi-leader | Multiple nodes accept writes | Better availability | Conflict resolution needed |
| Leaderless | Any replica accepts reads/writes | Highest availability | Requires quorum consensus |
Recommendation for high availability: Leaderless with quorum consensus
Replication factor (N): Typically 3 replicas
- Survives 2 node failures
- Good balance of durability vs storage cost
- Place replicas in different racks/data centers
Decision 4: Consistency Level (Quorum)
Define W (write quorum) and R (read quorum):
| Configuration | W | R | Behavior | Use Case |
|---|---|---|---|---|
| Strong consistency | 2 | 2 | W + R > N guarantees overlap | Financial data |
| Fast writes | 1 | 3 | Quick writes, slower reads | Write-heavy workloads |
| Fast reads | 3 | 1 | Slow writes, quick reads | Read-heavy workloads |
| Eventual consistency | 1 | 1 | Fastest, may read stale | Caching, sessions |
The math: If W + R > N, at least one node in the read quorum has the latest write.
Typical production config: N=3, W=2, R=2 (strong consistency with reasonable latency)
Decision 5: Conflict Resolution
When concurrent writes happen to different replicas:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Last-Write-Wins (LWW) | Highest timestamp wins | Simple | Data loss, clock sync issues |
| Vector clocks | Track causality per server | No data loss, detects conflicts | Complex, client resolves conflicts |
| CRDTs | Data structures that auto-merge | No conflicts | Limited data types |
Recommendation:
- LWW for simplicity when data loss is acceptable (caching)
- Vector clocks when data integrity matters (shopping carts)
Decision 6: Storage Engine
| Engine | Optimized For | How It Works | Examples |
|---|---|---|---|
| LSM Tree | Writes | Sequential writes to memtable, flush to SSTable | Cassandra, RocksDB, LevelDB |
| B-Tree | Reads | In-place updates, indexed | MySQL, PostgreSQL |
LSM Tree write path:
- Write to in-memory buffer (memtable)
- When full, flush to immutable sorted file (SSTable)
- Periodically compact/merge SSTables
LSM Tree read path:
- Check memtable
- Check SSTables (newest first)
- Use bloom filters to skip SSTables that definitely do not contain the key
Recommendation: LSM Tree for write-heavy workloads (most KV stores)
Step 4: Failure Handling
Failure Detection: Gossip Protocol
Each node:
- Maintains membership list with heartbeat counters
- Periodically sends heartbeats to random nodes
- Marks node as suspected if heartbeat stops for threshold period
- Confirms failure when multiple nodes agree
Advantages over heartbeat to central monitor:
- Decentralized = no single point of failure
- Scales better than all-to-all communication
- Tolerant to individual node failures
Temporary Failures: Sloppy Quorum + Hinted Handoff
When a replica is temporarily down:
- Write to another available node instead
- That node stores a "hint" about intended destination
- When original node recovers, hints are forwarded
Benefit: Maintains availability during temporary failures without sacrificing durability.
Permanent Failures: Anti-Entropy with Merkle Trees
To detect and repair divergent replicas:
- Build hash tree (Merkle tree) of data on each replica
- Compare tree roots - if equal, data matches
- If different, traverse tree to find divergent branches
- Sync only the different data
Benefit: Minimizes data transfer - only sync what is different.
Data Center Outages
- Replicate across multiple data centers
- Use geo-aware consistent hashing (replicas in different DCs)
- Accept higher latency for cross-DC writes
Step 5: Performance Optimizations
| Optimization | What It Does | Impact |
|---|---|---|
| Bloom filters | Probabilistic "maybe in set" check | Reduces disk reads by 90%+ |
| Compression | Compress SSTables (LZ4, Snappy) | 2-4x storage reduction |
| Row cache | Cache hot rows in memory | Sub-ms reads for popular keys |
| Key cache | Cache SSTable index entries | Faster SSTable lookups |
| Read repair | Fix stale replicas during reads | Passive consistency maintenance |
Summary: Techniques by Goal
| Goal | Technique |
|---|---|
| Store big data | Consistent hashing for partitioning |
| High availability for reads | Replication |
| High availability for writes | Leaderless replication, sloppy quorum |
| Tunable consistency | Configurable W, R, N |
| Handle temporary failures | Sloppy quorum, hinted handoff |
| Handle permanent failures | Merkle trees, anti-entropy repair |
| Handle data center outage | Cross-DC replication |
| Detect failures | Gossip protocol |
| Fast writes | LSM tree storage engine |
| Fast reads | Bloom filters, caching |
| Resolve conflicts | Vector clocks or LWW |
Real-World Systems
| System | Type | Consistency Model | Notable Features |
|---|---|---|---|
| DynamoDB | AP | Eventual (tunable) | Fully managed, auto-scaling |
| Cassandra | AP | Eventual (tunable) | Wide-column, CQL query language |
| Redis Cluster | AP | Eventual | In-memory, sub-ms latency |
| etcd | CP | Strong (Raft) | Coordination, config storage |
| CockroachDB | CP | Strong | SQL interface, distributed |
Summary: Key Design Decisions
| Decision | Options | Recommendation |
|---|---|---|
| CAP choice | CP, AP | AP for most use cases |
| Partitioning | Range, Hash, Consistent hash | Consistent hashing with virtual nodes |
| Replication | Single leader, Multi-leader, Leaderless | Leaderless with quorum |
| Consistency | Strong, Eventual | Tunable (N=3, W=2, R=2 default) |
| Conflict resolution | LWW, Vector clocks, CRDTs | Depends on use case |
| Storage engine | B-Tree, LSM Tree | LSM Tree for write-heavy |