Skip to main content

Design a Distributed Cache

Build a distributed caching system like Redis or Memcached that can store billions of key-value pairs across multiple nodes with sub-millisecond latency.

Related Concepts: Consistent Hashing, Replication, Caching

Step 1: Requirements and Scope

Functional Requirements

RequirementDescription
Basic operationsGET, SET, DELETE with TTL support
Data typesStrings, lists, sets, hashes
Atomic operationsINCREMENT, DECREMENT, APPEND
ExpirationTime-based key expiration
PersistenceOptional disk persistence

Non-Functional Requirements

RequirementTargetRationale
Latency< 1ms p99Caches are on the critical path - slow cache defeats the purpose
Throughput1M+ ops/sec per nodeNeed to handle bursty traffic
Availability99.99%Cache failures cause database stampedes
ScalabilityLinear horizontal scalingAdd nodes to add capacity

Scale Estimation

Design for a large-scale deployment:

  • Storage: 100TB total data across cluster
  • Operations: 10M requests per second
  • Key size: Average 100 bytes
  • Value size: Average 1KB
  • Keys: ~100 billion keys

Per node (assuming 100 nodes):

  • 1TB RAM per node
  • 100K ops/sec per node

Step 2: High-Level Design

Loading diagram...

Key Components:

  • Client Library: Handles routing, connection pooling, and failover
  • Cache Nodes: Store data in memory with optional persistence
  • Configuration Store: Tracks cluster membership and configuration
  • Replicas: Provide redundancy for each primary node

Step 3: Data Partitioning

Data must be spread across nodes. There are three main options.

Option 1: Modulo Hashing

The hash of the key modulo the number of nodes determines which node stores the data. Simple but problematic when nodes change. Adding or removing a node reshuffles almost everything.

Option 2: Consistent Hashing

Loading diagram...

Each node owns a range on the hash ring. When a node joins or leaves, only adjacent keys move.

Option 3: Virtual Nodes

Improvement on consistent hashing. Each physical node gets multiple positions on the ring (virtual nodes). This spreads data more evenly and handles heterogeneous hardware.

ApproachProsConsBest For
ModuloSimpleFull reshuffle on changeFixed cluster size
Consistent HashingMinimal reshufflingCan have hotspotsDynamic clusters
Virtual NodesEven distributionMore memory for ringProduction systems

Recommendation: Virtual nodes. Redis Cluster uses 16,384 hash slots (virtual nodes) across the cluster.

Step 4: Memory Management

The cache will fill up. What happens then?

Eviction Policies

Loading diagram...
PolicyHow It WorksProsCons
LRUEvict least recently accessedGood for recency-based accessScan-resistant (full table scan pollutes cache)
LFUEvict least frequently accessedGood for popularity-based accessDoes not adapt to changing patterns
RandomEvict random keyO(1), no metadata overheadMay evict hot keys
TTLEvict keys nearest to expirationGood when TTLs are meaningfulRequires TTL on all keys

Recommendation: LRU with sampling. Instead of tracking exact LRU order (expensive), sample a few random keys and evict the oldest. Redis does this with configurable sample size.

Memory Fragmentation

Long-running caches suffer from fragmentation. The memory allocator has free space, but it is in chunks too small to use.

Solutions:

  • jemalloc: Memory allocator designed for long-running processes
  • Slab allocation: Pre-allocate fixed-size chunks (Memcached approach)
  • Defragmentation: Periodically move values to consolidate free space

Step 5: Replication and Failover

A cache going down should not cause a meltdown.

Replication Strategies

Loading diagram...
StrategyLatencyDurabilityUse Case
No replicationLowestNoneEphemeral cache
Async replicationLowMay lose recent writesMost caches
Sync replicationHigherStrongCache-aside pattern

Recommendation: Async replication with at least one replica. Accept that a few seconds of writes might be lost on failover. That is acceptable for a cache.

Failover Process

Loading diagram...

Failover should be automatic:

  1. Health checks detect primary failure
  2. Coordinator promotes replica
  3. Clients get updated routing
  4. Traffic shifts to new primary

Step 6: Persistence Options

Pure in-memory cache loses everything on restart. Sometimes persistence is needed.

Approaches

ApproachHow It WorksRecovery TimePerformance Impact
NoneMemory onlyCold startNone
Snapshots (RDB)Periodic full dumpMinutesFork + write
Append-Only (AOF)Log every writeSecondsfsync overhead
HybridSnapshots + AOF since last snapshotSecondsBest of both

Redis uses the hybrid approach by default: periodic snapshots plus AOF for recent changes.

Snapshotting Trick: Copy-on-Write

How do you snapshot without stopping writes? Fork the process.

Loading diagram...

The child process sees a consistent snapshot via copy-on-write. Parent continues serving requests. Only modified pages get copied.

Step 7: Hot Key Problem

What happens when one key gets much more traffic than others? That node becomes a bottleneck.

Detection

Track request rates per key. Keys with 100x average rate are hot.

Solutions

SolutionHow It WorksTrade-offs
Local cachingClient caches hot keys locallyStaleness risk
Key replicationStore hot keys on multiple nodesConsistency complexity
Key splittingSplit popular_key into popular_key_1, popular_key_2Application changes

Practical approach: Client-side local cache with short TTL (100ms). Most hot key problems are read-heavy, and slight staleness is acceptable.

Step 8: Client-Side Concerns

Connection Pooling

Do not open a new connection per request. Maintain a pool. The optimal pool size equals requests per second multiplied by average latency, then doubled for headroom. For 10,000 requests per second with 1ms latency, this means a pool of 20-40 connections.

Request Pipelining

Loading diagram...

Pipelining batches requests to amortize network round-trips. Can improve throughput 5-10x.

Real-World Systems

SystemNotable Design Choice
RedisSingle-threaded event loop, virtual nodes (hash slots), async replication
MemcachedMulti-threaded, slab allocator, no persistence
AWS ElastiCacheManaged Redis/Memcached, automatic failover
TwemproxyProxy layer for sharding Memcached/Redis

Summary: Key Design Decisions

DecisionOptionsRecommendation
PartitioningModulo, consistent hashing, virtual nodesVirtual nodes for even distribution
EvictionLRU, LFU, random, TTLSampled LRU
ReplicationNone, async, syncAsync with 1+ replica
PersistenceNone, snapshots, AOFHybrid (snapshots + AOF)
Hot keysLocal cache, replication, splittingClient-side local cache