Distributed Systems
A distributed system runs on multiple machines that communicate over a network. This section covers the challenges of distributed computing, the CAP theorem, consistency models, consensus algorithms, and distributed data patterns.
Reasons for Distribution
| Reason | Description |
|---|---|
| Scale | Multiple machines provide more capacity than a single machine |
| Reliability | Multiple machines provide redundancy; one failure does not cause total failure |
| Latency | Servers in multiple regions reduce network latency for geographically distributed users |
| Cost | Many commodity machines can be more cost-effective than one high-end machine |
Challenges
Network Unreliability
Networks experience message loss, duplication, reordering, and variable latency. Network partitions can isolate portions of the system.
Clock Unreliability
Machine clocks drift and are not perfectly synchronized. NTP provides approximate synchronization but not perfect agreement. Determining the order of events across machines based on timestamps is unreliable.
Partial Failures
In distributed systems, some components can fail while others continue operating. Distinguishing between a slow component and a failed component is difficult.
CAP Theorem
The CAP theorem states that a distributed system can provide at most two of three guarantees simultaneously:
| Property | Description |
|---|---|
| Consistency | Every read returns the most recent write |
| Availability | Every request receives a response |
| Partition Tolerance | The system continues operating during network partitions |
Network partitions are unavoidable in distributed systems. During a partition, the system must choose between:
- CP (Consistency over Availability): Reject requests rather than return potentially stale data
- AP (Availability over Consistency): Return available data (potentially stale) rather than reject requests
PACELC Extension
PACELC extends CAP to address normal operation (no partition):
- If Partition: Choose Availability or Consistency
- Else (normal operation): Choose Latency or Consistency
Strong consistency requires waiting for replicas to confirm, increasing latency. Lower latency requires accepting potentially stale reads.
Consistency Models
Strong Consistency
Every read returns the most recent write, as if the system were a single machine.
| Aspect | Description |
|---|---|
| Implementation | Single leader handles all writes, synchronously replicates before confirming |
| Trade-off | Slower writes, reduced availability during failures |
Eventual Consistency
If writes stop, all replicas eventually converge to the same state. Immediately after a write, different replicas may return different values.
| Aspect | Description |
|---|---|
| Implementation | Asynchronous replication; writes confirm before replicas synchronize |
| Trade-off | Stale reads are possible |
Read-Your-Writes
A client always sees its own writes, though it may not see other clients' recent writes.
| Aspect | Description |
|---|---|
| Implementation | Route reads to the server that handled the client's write, or track versions |
| Trade-off | Additional routing or tracking overhead |
Causal Consistency
If event A caused event B, all replicas see A before B. Causally unrelated events can appear in any order.
| Aspect | Description |
|---|---|
| Implementation | Vector clocks to track causal dependencies |
| Trade-off | Complexity of tracking causality |
Consensus
Consensus algorithms enable multiple machines to agree on a value despite message loss and machine failures.
Raft Algorithm
Raft is a consensus algorithm designed for understandability.
| Phase | Description |
|---|---|
| Leader Election | Nodes vote to elect one leader |
| Log Replication | Leader accepts writes and replicates to followers |
| Safety | Only one leader at a time is guaranteed |
Paxos
Paxos is an older consensus algorithm with the same goals as Raft but greater complexity.
Consensus Applications
| Application | Description |
|---|---|
| Leader election | Selecting which node coordinates operations |
| Distributed locks | ZooKeeper, etcd |
| Configuration management | Agreement on current system configuration |
| Replicated databases | Determining the order of writes |
Distributed Transactions
Distributed transactions coordinate updates across multiple services atomically.
Two-Phase Commit (2PC)
| Phase | Description |
|---|---|
| Prepare | Coordinator asks all participants if they are ready to commit |
| Commit | If all participants agree, coordinator instructs them to commit; otherwise, abort |
Limitation: If the coordinator fails after the prepare phase, participants are blocked waiting for a decision.
Saga Pattern
Sagas break a transaction into a sequence of local transactions. Each step has a compensating action that undoes it if a later step fails.
| Style | Description |
|---|---|
| Choreography | Services coordinate via events (decentralized) |
| Orchestration | Central service directs the transaction flow |
Sagas provide better availability than 2PC but have more complex failure handling.
Time and Ordering
Logical Clocks
Logical clocks order events without relying on physical time.
Lamport Timestamps: Each event receives a counter value. On sending a message, include the timestamp. On receiving, set local timestamp to max(local, received) + 1.
Vector Clocks: Each node tracks its own counter plus its knowledge of other nodes' counters. This enables detecting concurrent events (neither caused the other).
Physical Clocks
Physical clocks are necessary for some applications (audit logs, time-based expiration).
| Challenge | Description |
|---|---|
| Clock drift | Server clocks diverge from true time |
| NTP limitations | NTP provides approximate, not perfect, synchronization |
| Leap seconds | Periodic adjustments cause edge cases |
TrueTime (Google Spanner): GPS receivers on servers provide time with bounded uncertainty (time is between X and Y).
Distributed Data Patterns
Replication
| Pattern | Description | Trade-off |
|---|---|---|
| Single Leader | One server handles writes; replicas handle reads | Simple but write bottleneck |
| Multi-Leader | Multiple servers accept writes | Requires conflict resolution |
| Leaderless | Any node handles any request; uses quorums | W + R > N ensures consistency |
Partitioning (Sharding)
| Strategy | Description | Trade-off |
|---|---|---|
| Range partitioning | Partition by value ranges | Hot spots if distribution uneven |
| Hash partitioning | Hash key, mod by partition count | Even distribution, no range queries |
| Consistent hashing | Keys and nodes on a ring | Adding/removing nodes affects only neighbors |
Design Considerations
| Topic | Considerations |
|---|---|
| Trade-offs | Consistency vs availability vs latency; articulate which to prioritize and why |
| Use case matching | Strong consistency for financial systems; eventual consistency acceptable for social feeds |
| CAP in practice | During a network partition, does the system reject requests or return potentially stale data? |
| Failure modes | What happens when specific components fail? When the network partitions? |
| Operational complexity | Distributed systems are operationally complex; acknowledge this in designs |