Skip to main content

Design a Key-Value Store

Design a distributed key-value store that supports get(key) and put(key, value) operations at massive scale.

Related Concepts: Consistent Hashing, Replication, CAP Theorem, Quorum Consensus, Vector Clocks, Gossip Protocol, LSM Trees, Merkle Trees, Anti-Entropy, Sloppy Quorum

Step 1: Requirements and Scope

Functional Requirements

  • put(key, value) - Store a key-value pair
  • get(key) - Retrieve value by key
  • Keys and values are strings
  • Values can be large (up to several KB)

Non-Functional Requirements

RequirementTargetRationale
Availability99.99% uptimeUsers expect always-on access
Latency< 10ms p99Real-time applications need fast responses
ScalabilityPetabytes of dataData grows continuously
DurabilityNo data lossData is the business
ConsistencyTunableDifferent use cases need different guarantees

Scale Estimation

  • 10 billion key-value pairs
  • Average key: 100 bytes, average value: 10 KB
  • Storage: 10B x 10KB = 100 TB
  • Read:write ratio: 10:1
  • 100K requests/second

Step 2: High-Level Design

Loading diagram...

Key Components:

  • Coordinator nodes: Route requests to the correct storage nodes
  • Storage nodes: Store the actual key-value data
  • Partitions: Data is split across nodes using consistent hashing

Step 3: Core Design Decisions

Decision 1: CAP Theorem - CP vs AP

The CAP theorem states you can only guarantee two of three properties during network partitions:

  • Consistency: All clients see the same data
  • Availability: System always responds
  • Partition tolerance: System works despite network failures
System TypeGuaranteesSacrificesUse Cases
CP SystemConsistency + Partition ToleranceAvailability during partitionsBanking, inventory, booking systems
AP SystemAvailability + Partition ToleranceStrong consistencySocial media, caching, session storage

Real-world examples:

  • CP: Google Spanner, HBase, MongoDB (default)
  • AP: DynamoDB, Cassandra, Riak

Recommendation for most use cases: AP with eventual consistency, because:

  • Network partitions are rare but happen
  • Brief inconsistency is often acceptable
  • Availability impacts user experience directly

Decision 2: Data Partitioning Strategy

StrategyHow It WorksProsCons
Range-basedKeys A-M on Node 1, N-Z on Node 2Good for range queriesHot spots if keys are sequential
Hash-basedhash(key) % N selects NodeEven distributionAdding nodes requires full reshuffle
Consistent hashingKeys map to ring positionsMinimal reshuffling on node changesMore complex to implement

Recommendation: Consistent hashing with virtual nodes

  • When adding/removing nodes, only K/N keys need to move (K=total keys, N=nodes)
  • Virtual nodes ensure even distribution despite physical node differences
Loading diagram...

Decision 3: Replication Strategy

StrategyHow It WorksProsCons
Single leaderAll writes to one node, replicate to followersSimple, consistentLeader is bottleneck/SPOF
Multi-leaderMultiple nodes accept writesBetter availabilityConflict resolution needed
LeaderlessAny replica accepts reads/writesHighest availabilityRequires quorum consensus

Recommendation for high availability: Leaderless with quorum consensus

Replication factor (N): Typically 3 replicas

  • Survives 2 node failures
  • Good balance of durability vs storage cost
  • Place replicas in different racks/data centers
Loading diagram...

Decision 4: Consistency Level (Quorum)

Define W (write quorum) and R (read quorum):

ConfigurationWRBehaviorUse Case
Strong consistency22W + R > N guarantees overlapFinancial data
Fast writes13Quick writes, slower readsWrite-heavy workloads
Fast reads31Slow writes, quick readsRead-heavy workloads
Eventual consistency11Fastest, may read staleCaching, sessions

The math: If W + R > N, at least one node in the read quorum has the latest write.

Typical production config: N=3, W=2, R=2 (strong consistency with reasonable latency)

Decision 5: Conflict Resolution

When concurrent writes happen to different replicas:

StrategyHow It WorksProsCons
Last-Write-Wins (LWW)Highest timestamp winsSimpleData loss, clock sync issues
Vector clocksTrack causality per serverNo data loss, detects conflictsComplex, client resolves conflicts
CRDTsData structures that auto-mergeNo conflictsLimited data types

Recommendation:

  • LWW for simplicity when data loss is acceptable (caching)
  • Vector clocks when data integrity matters (shopping carts)

Decision 6: Storage Engine

EngineOptimized ForHow It WorksExamples
LSM TreeWritesSequential writes to memtable, flush to SSTableCassandra, RocksDB, LevelDB
B-TreeReadsIn-place updates, indexedMySQL, PostgreSQL

LSM Tree write path:

  1. Write to in-memory buffer (memtable)
  2. When full, flush to immutable sorted file (SSTable)
  3. Periodically compact/merge SSTables

LSM Tree read path:

  1. Check memtable
  2. Check SSTables (newest first)
  3. Use bloom filters to skip SSTables that definitely do not contain the key
Loading diagram...

Recommendation: LSM Tree for write-heavy workloads (most KV stores)

Step 4: Failure Handling

Failure Detection: Gossip Protocol

Each node:

  1. Maintains membership list with heartbeat counters
  2. Periodically sends heartbeats to random nodes
  3. Marks node as suspected if heartbeat stops for threshold period
  4. Confirms failure when multiple nodes agree

Advantages over heartbeat to central monitor:

  • Decentralized = no single point of failure
  • Scales better than all-to-all communication
  • Tolerant to individual node failures
Loading diagram...

Temporary Failures: Sloppy Quorum + Hinted Handoff

When a replica is temporarily down:

  1. Write to another available node instead
  2. That node stores a "hint" about intended destination
  3. When original node recovers, hints are forwarded
Loading diagram...

Benefit: Maintains availability during temporary failures without sacrificing durability.

Permanent Failures: Anti-Entropy with Merkle Trees

To detect and repair divergent replicas:

  1. Build hash tree (Merkle tree) of data on each replica
  2. Compare tree roots - if equal, data matches
  3. If different, traverse tree to find divergent branches
  4. Sync only the different data

Benefit: Minimizes data transfer - only sync what is different.

Data Center Outages

  • Replicate across multiple data centers
  • Use geo-aware consistent hashing (replicas in different DCs)
  • Accept higher latency for cross-DC writes

Step 5: Performance Optimizations

OptimizationWhat It DoesImpact
Bloom filtersProbabilistic "maybe in set" checkReduces disk reads by 90%+
CompressionCompress SSTables (LZ4, Snappy)2-4x storage reduction
Row cacheCache hot rows in memorySub-ms reads for popular keys
Key cacheCache SSTable index entriesFaster SSTable lookups
Read repairFix stale replicas during readsPassive consistency maintenance

Summary: Techniques by Goal

GoalTechnique
Store big dataConsistent hashing for partitioning
High availability for readsReplication
High availability for writesLeaderless replication, sloppy quorum
Tunable consistencyConfigurable W, R, N
Handle temporary failuresSloppy quorum, hinted handoff
Handle permanent failuresMerkle trees, anti-entropy repair
Handle data center outageCross-DC replication
Detect failuresGossip protocol
Fast writesLSM tree storage engine
Fast readsBloom filters, caching
Resolve conflictsVector clocks or LWW

Real-World Systems

SystemTypeConsistency ModelNotable Features
DynamoDBAPEventual (tunable)Fully managed, auto-scaling
CassandraAPEventual (tunable)Wide-column, CQL query language
Redis ClusterAPEventualIn-memory, sub-ms latency
etcdCPStrong (Raft)Coordination, config storage
CockroachDBCPStrongSQL interface, distributed

Summary: Key Design Decisions

DecisionOptionsRecommendation
CAP choiceCP, APAP for most use cases
PartitioningRange, Hash, Consistent hashConsistent hashing with virtual nodes
ReplicationSingle leader, Multi-leader, LeaderlessLeaderless with quorum
ConsistencyStrong, EventualTunable (N=3, W=2, R=2 default)
Conflict resolutionLWW, Vector clocks, CRDTsDepends on use case
Storage engineB-Tree, LSM TreeLSM Tree for write-heavy