Design a Key-Value Store

Design a distributed key-value store that supports get(key) and put(key, value) operations at massive scale.

Related Concepts: Consistent Hashing, Replication, CAP Theorem, Quorum Consensus, Vector Clocks, Gossip Protocol, LSM Trees, Merkle Trees, Anti-Entropy, Sloppy Quorum

Step 1: Requirements and Scope

Functional Requirements

put(key, value) - Store a key-value pair
get(key) - Retrieve value by key
Keys and values are strings
Values can be large (up to several KB)

Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99% uptime	Users expect always-on access
Latency	< 10ms p99	Real-time applications need fast responses
Scalability	Petabytes of data	Data grows continuously
Durability	No data loss	Data is the business
Consistency	Tunable	Different use cases need different guarantees

Scale Estimation

10 billion key-value pairs
Average key: 100 bytes, average value: 10 KB
Storage: 10B x 10KB = 100 TB
Read:write ratio: 10:1
100K requests/second

Step 2: High-Level Design

Loading diagram...

Key Components:

Coordinator nodes: Route requests to the correct storage nodes
Storage nodes: Store the actual key-value data
Partitions: Data is split across nodes using consistent hashing

Step 3: Core Design Decisions

Decision 1: CAP Theorem - CP vs AP

The CAP theorem states you can only guarantee two of three properties during network partitions:

Consistency: All clients see the same data
Availability: System always responds
Partition tolerance: System works despite network failures

System Type	Guarantees	Sacrifices	Use Cases
CP System	Consistency + Partition Tolerance	Availability during partitions	Banking, inventory, booking systems
AP System	Availability + Partition Tolerance	Strong consistency	Social media, caching, session storage

Real-world examples:

CP: Google Spanner, HBase, MongoDB (default)
AP: DynamoDB, Cassandra, Riak

Recommendation for most use cases: AP with eventual consistency, because:

Network partitions are rare but happen
Brief inconsistency is often acceptable
Availability impacts user experience directly

Decision 2: Data Partitioning Strategy

Strategy	How It Works	Pros	Cons
Range-based	Keys A-M on Node 1, N-Z on Node 2	Good for range queries	Hot spots if keys are sequential
Hash-based	hash(key) % N selects Node	Even distribution	Adding nodes requires full reshuffle
Consistent hashing	Keys map to ring positions	Minimal reshuffling on node changes	More complex to implement

Recommendation: Consistent hashing with virtual nodes

When adding/removing nodes, only K/N keys need to move (K=total keys, N=nodes)
Virtual nodes ensure even distribution despite physical node differences

Loading diagram...

Decision 3: Replication Strategy

Strategy	How It Works	Pros	Cons
Single leader	All writes to one node, replicate to followers	Simple, consistent	Leader is bottleneck/SPOF
Multi-leader	Multiple nodes accept writes	Better availability	Conflict resolution needed
Leaderless	Any replica accepts reads/writes	Highest availability	Requires quorum consensus

Recommendation for high availability: Leaderless with quorum consensus

Replication factor (N): Typically 3 replicas

Survives 2 node failures
Good balance of durability vs storage cost
Place replicas in different racks/data centers

Loading diagram...

Decision 4: Consistency Level (Quorum)

Define W (write quorum) and R (read quorum):

Configuration	W	R	Behavior	Use Case
Strong consistency	2	2	W + R > N guarantees overlap	Financial data
Fast writes	1	3	Quick writes, slower reads	Write-heavy workloads
Fast reads	3	1	Slow writes, quick reads	Read-heavy workloads
Eventual consistency	1	1	Fastest, may read stale	Caching, sessions

The math: If W + R > N, at least one node in the read quorum has the latest write.

Typical production config: N=3, W=2, R=2 (strong consistency with reasonable latency)

Decision 5: Conflict Resolution

When concurrent writes happen to different replicas:

Strategy	How It Works	Pros	Cons
Last-Write-Wins (LWW)	Highest timestamp wins	Simple	Data loss, clock sync issues
Vector clocks	Track causality per server	No data loss, detects conflicts	Complex, client resolves conflicts
CRDTs	Data structures that auto-merge	No conflicts	Limited data types

Recommendation:

LWW for simplicity when data loss is acceptable (caching)
Vector clocks when data integrity matters (shopping carts)

Decision 6: Storage Engine

Engine	Optimized For	How It Works	Examples
LSM Tree	Writes	Sequential writes to memtable, flush to SSTable	Cassandra, RocksDB, LevelDB
B-Tree	Reads	In-place updates, indexed	MySQL, PostgreSQL

LSM Tree write path:

Write to in-memory buffer (memtable)
When full, flush to immutable sorted file (SSTable)
Periodically compact/merge SSTables

LSM Tree read path:

Check memtable
Check SSTables (newest first)
Use bloom filters to skip SSTables that definitely do not contain the key

Loading diagram...

Recommendation: LSM Tree for write-heavy workloads (most KV stores)

Step 4: Failure Handling

Failure Detection: Gossip Protocol

Each node:

Maintains membership list with heartbeat counters
Periodically sends heartbeats to random nodes
Marks node as suspected if heartbeat stops for threshold period
Confirms failure when multiple nodes agree

Advantages over heartbeat to central monitor:

Decentralized = no single point of failure
Scales better than all-to-all communication
Tolerant to individual node failures

Loading diagram...

Temporary Failures: Sloppy Quorum + Hinted Handoff

When a replica is temporarily down:

Write to another available node instead
That node stores a "hint" about intended destination
When original node recovers, hints are forwarded

Loading diagram...

Benefit: Maintains availability during temporary failures without sacrificing durability.

Permanent Failures: Anti-Entropy with Merkle Trees

To detect and repair divergent replicas:

Build hash tree (Merkle tree) of data on each replica
Compare tree roots - if equal, data matches
If different, traverse tree to find divergent branches
Sync only the different data

Benefit: Minimizes data transfer - only sync what is different.

Data Center Outages

Replicate across multiple data centers
Use geo-aware consistent hashing (replicas in different DCs)
Accept higher latency for cross-DC writes

Step 5: Performance Optimizations

Optimization	What It Does	Impact
Bloom filters	Probabilistic "maybe in set" check	Reduces disk reads by 90%+
Compression	Compress SSTables (LZ4, Snappy)	2-4x storage reduction
Row cache	Cache hot rows in memory	Sub-ms reads for popular keys
Key cache	Cache SSTable index entries	Faster SSTable lookups
Read repair	Fix stale replicas during reads	Passive consistency maintenance

Summary: Techniques by Goal

Goal	Technique
Store big data	Consistent hashing for partitioning
High availability for reads	Replication
High availability for writes	Leaderless replication, sloppy quorum
Tunable consistency	Configurable W, R, N
Handle temporary failures	Sloppy quorum, hinted handoff
Handle permanent failures	Merkle trees, anti-entropy repair
Handle data center outage	Cross-DC replication
Detect failures	Gossip protocol
Fast writes	LSM tree storage engine
Fast reads	Bloom filters, caching
Resolve conflicts	Vector clocks or LWW

Real-World Systems

System	Type	Consistency Model	Notable Features
DynamoDB	AP	Eventual (tunable)	Fully managed, auto-scaling
Cassandra	AP	Eventual (tunable)	Wide-column, CQL query language
Redis Cluster	AP	Eventual	In-memory, sub-ms latency
etcd	CP	Strong (Raft)	Coordination, config storage
CockroachDB	CP	Strong	SQL interface, distributed

Summary: Key Design Decisions

Decision	Options	Recommendation
CAP choice	CP, AP	AP for most use cases
Partitioning	Range, Hash, Consistent hash	Consistent hashing with virtual nodes
Replication	Single leader, Multi-leader, Leaderless	Leaderless with quorum
Consistency	Strong, Eventual	Tunable (N=3, W=2, R=2 default)
Conflict resolution	LWW, Vector clocks, CRDTs	Depends on use case
Storage engine	B-Tree, LSM Tree	LSM Tree for write-heavy

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Design​

Step 3: Core Design Decisions​

Decision 1: CAP Theorem - CP vs AP​

Decision 2: Data Partitioning Strategy​

Decision 3: Replication Strategy​

Decision 4: Consistency Level (Quorum)​

Decision 5: Conflict Resolution​

Decision 6: Storage Engine​

Step 4: Failure Handling​

Failure Detection: Gossip Protocol​

Temporary Failures: Sloppy Quorum + Hinted Handoff​

Permanent Failures: Anti-Entropy with Merkle Trees​

Data Center Outages​

Step 5: Performance Optimizations​

Summary: Techniques by Goal​

Real-World Systems​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Design

Step 3: Core Design Decisions

Decision 1: CAP Theorem - CP vs AP

Decision 2: Data Partitioning Strategy

Decision 3: Replication Strategy

Decision 4: Consistency Level (Quorum)

Decision 5: Conflict Resolution

Decision 6: Storage Engine

Step 4: Failure Handling

Failure Detection: Gossip Protocol

Temporary Failures: Sloppy Quorum + Hinted Handoff

Permanent Failures: Anti-Entropy with Merkle Trees

Data Center Outages

Step 5: Performance Optimizations

Summary: Techniques by Goal

Real-World Systems

Summary: Key Design Decisions