Design a Distributed Cache

Build a distributed caching system like Redis or Memcached that can store billions of key-value pairs across multiple nodes with sub-millisecond latency.

Related Concepts: Consistent Hashing, Replication, Caching

Step 1: Requirements and Scope

Functional Requirements

Requirement	Description
Basic operations	GET, SET, DELETE with TTL support
Data types	Strings, lists, sets, hashes
Atomic operations	INCREMENT, DECREMENT, APPEND
Expiration	Time-based key expiration
Persistence	Optional disk persistence

Non-Functional Requirements

Requirement	Target	Rationale
Latency	< 1ms p99	Caches are on the critical path - slow cache defeats the purpose
Throughput	1M+ ops/sec per node	Need to handle bursty traffic
Availability	99.99%	Cache failures cause database stampedes
Scalability	Linear horizontal scaling	Add nodes to add capacity

Scale Estimation

Design for a large-scale deployment:

Storage: 100TB total data across cluster
Operations: 10M requests per second
Key size: Average 100 bytes
Value size: Average 1KB
Keys: ~100 billion keys

Per node (assuming 100 nodes):

1TB RAM per node
100K ops/sec per node

Step 2: High-Level Design

Loading diagram...

Key Components:

Client Library: Handles routing, connection pooling, and failover
Cache Nodes: Store data in memory with optional persistence
Configuration Store: Tracks cluster membership and configuration
Replicas: Provide redundancy for each primary node

Step 3: Data Partitioning

Data must be spread across nodes. There are three main options.

Option 1: Modulo Hashing

The hash of the key modulo the number of nodes determines which node stores the data. Simple but problematic when nodes change. Adding or removing a node reshuffles almost everything.

Option 2: Consistent Hashing

Loading diagram...

Each node owns a range on the hash ring. When a node joins or leaves, only adjacent keys move.

Option 3: Virtual Nodes

Improvement on consistent hashing. Each physical node gets multiple positions on the ring (virtual nodes). This spreads data more evenly and handles heterogeneous hardware.

Approach	Pros	Cons	Best For
Modulo	Simple	Full reshuffle on change	Fixed cluster size
Consistent Hashing	Minimal reshuffling	Can have hotspots	Dynamic clusters
Virtual Nodes	Even distribution	More memory for ring	Production systems

Recommendation: Virtual nodes. Redis Cluster uses 16,384 hash slots (virtual nodes) across the cluster.

Step 4: Memory Management

The cache will fill up. What happens then?

Eviction Policies

Loading diagram...

Policy	How It Works	Pros	Cons
LRU	Evict least recently accessed	Good for recency-based access	Scan-resistant (full table scan pollutes cache)
LFU	Evict least frequently accessed	Good for popularity-based access	Does not adapt to changing patterns
Random	Evict random key	O(1), no metadata overhead	May evict hot keys
TTL	Evict keys nearest to expiration	Good when TTLs are meaningful	Requires TTL on all keys

Recommendation: LRU with sampling. Instead of tracking exact LRU order (expensive), sample a few random keys and evict the oldest. Redis does this with configurable sample size.

Memory Fragmentation

Long-running caches suffer from fragmentation. The memory allocator has free space, but it is in chunks too small to use.

Solutions:

jemalloc: Memory allocator designed for long-running processes
Slab allocation: Pre-allocate fixed-size chunks (Memcached approach)
Defragmentation: Periodically move values to consolidate free space

Step 5: Replication and Failover

A cache going down should not cause a meltdown.

Replication Strategies

Loading diagram...

Strategy	Latency	Durability	Use Case
No replication	Lowest	None	Ephemeral cache
Async replication	Low	May lose recent writes	Most caches
Sync replication	Higher	Strong	Cache-aside pattern

Recommendation: Async replication with at least one replica. Accept that a few seconds of writes might be lost on failover. That is acceptable for a cache.

Failover Process

Loading diagram...

Failover should be automatic:

Health checks detect primary failure
Coordinator promotes replica
Clients get updated routing
Traffic shifts to new primary

Step 6: Persistence Options

Pure in-memory cache loses everything on restart. Sometimes persistence is needed.

Approaches

Approach	How It Works	Recovery Time	Performance Impact
None	Memory only	Cold start	None
Snapshots (RDB)	Periodic full dump	Minutes	Fork + write
Append-Only (AOF)	Log every write	Seconds	fsync overhead
Hybrid	Snapshots + AOF since last snapshot	Seconds	Best of both

Redis uses the hybrid approach by default: periodic snapshots plus AOF for recent changes.

Snapshotting Trick: Copy-on-Write

How do you snapshot without stopping writes? Fork the process.

Loading diagram...

The child process sees a consistent snapshot via copy-on-write. Parent continues serving requests. Only modified pages get copied.

Step 7: Hot Key Problem

What happens when one key gets much more traffic than others? That node becomes a bottleneck.

Detection

Track request rates per key. Keys with 100x average rate are hot.

Solutions

Solution	How It Works	Trade-offs
Local caching	Client caches hot keys locally	Staleness risk
Key replication	Store hot keys on multiple nodes	Consistency complexity
Key splitting	Split `popular_key` into `popular_key_1`, `popular_key_2`	Application changes

Practical approach: Client-side local cache with short TTL (100ms). Most hot key problems are read-heavy, and slight staleness is acceptable.

Step 8: Client-Side Concerns

Connection Pooling

Do not open a new connection per request. Maintain a pool. The optimal pool size equals requests per second multiplied by average latency, then doubled for headroom. For 10,000 requests per second with 1ms latency, this means a pool of 20-40 connections.

Request Pipelining

Loading diagram...

Pipelining batches requests to amortize network round-trips. Can improve throughput 5-10x.

Real-World Systems

System	Notable Design Choice
Redis	Single-threaded event loop, virtual nodes (hash slots), async replication
Memcached	Multi-threaded, slab allocator, no persistence
AWS ElastiCache	Managed Redis/Memcached, automatic failover
Twemproxy	Proxy layer for sharding Memcached/Redis

Summary: Key Design Decisions

Decision	Options	Recommendation
Partitioning	Modulo, consistent hashing, virtual nodes	Virtual nodes for even distribution
Eviction	LRU, LFU, random, TTL	Sampled LRU
Replication	None, async, sync	Async with 1+ replica
Persistence	None, snapshots, AOF	Hybrid (snapshots + AOF)
Hot keys	Local cache, replication, splitting	Client-side local cache

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Design​

Step 3: Data Partitioning​

Option 1: Modulo Hashing​

Option 2: Consistent Hashing​

Option 3: Virtual Nodes​

Step 4: Memory Management​

Eviction Policies​

Memory Fragmentation​

Step 5: Replication and Failover​

Replication Strategies​

Failover Process​

Step 6: Persistence Options​

Approaches​

Snapshotting Trick: Copy-on-Write​

Step 7: Hot Key Problem​

Detection​

Solutions​

Step 8: Client-Side Concerns​

Connection Pooling​

Request Pipelining​

Real-World Systems​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Design

Step 3: Data Partitioning

Option 1: Modulo Hashing

Option 2: Consistent Hashing

Option 3: Virtual Nodes

Step 4: Memory Management

Eviction Policies

Memory Fragmentation

Step 5: Replication and Failover

Replication Strategies

Failover Process

Step 6: Persistence Options

Approaches

Snapshotting Trick: Copy-on-Write

Step 7: Hot Key Problem

Detection

Solutions

Step 8: Client-Side Concerns

Connection Pooling

Request Pipelining

Real-World Systems

Summary: Key Design Decisions