Availability Patterns
High availability ensures systems remain operational despite failures.
Measuring Availability
The Nines
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two 9s) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three 9s) | 8.77 hours | 43.83 minutes | 10.08 minutes |
| 99.99% (four 9s) | 52.60 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five 9s) | 5.26 minutes | 26.30 seconds | 6.05 seconds |
Calculating Availability
Serial components (both must work):
Availability = A1 x A2
Example: Web server (99.9%) -> Database (99.9%)
Total = 0.999 x 0.999 = 99.8%
Parallel components (either works):
Availability = 1 - (1 - A1) x (1 - A2)
Example: Two web servers, each 99.9%
Total = 1 - (0.001 x 0.001) = 99.9999%
Redundancy Patterns
Active-Passive (Failover)
One active server, standby takes over on failure.
Operation:
- Primary handles all traffic
- Standby monitors primary via heartbeat
- On failure detection, standby becomes active
- DNS or VIP switches to new active
| Advantages | Disadvantages |
|---|---|
| Simple to implement | Standby resources idle |
| Clear failover path | Failover time (seconds to minutes) |
| Works for stateful services | Data sync complexity |
Active-Active
Multiple servers handle traffic simultaneously.
Operation:
- All servers handle traffic
- Load balancer distributes requests
- If one fails, others absorb traffic
- No explicit failover required
| Advantages | Disadvantages |
|---|---|
| Full resource utilization | More complex (state sync) |
| No failover delay | Requires stateless design |
| Better capacity | Load balancer is potential SPOF |
Comparison
| Aspect | Active-Passive | Active-Active |
|---|---|---|
| Resource usage | 50% (standby idle) | 100% |
| Failover time | Seconds-minutes | Instant |
| Complexity | Lower | Higher |
| Scaling | Replace standby | Add nodes |
| Use cases | Databases, stateful | Web servers, stateless |
Replication Patterns
Single Leader
| Advantages | Disadvantages |
|---|---|
| Simple consistency | Leader is SPOF |
| Easy to understand | Write bottleneck |
| Conflict-free | Follower lag |
Multi-Leader
| Advantages | Disadvantages |
|---|---|
| Write scalability | Conflict resolution |
| Geographic distribution | Complexity |
| Tolerates leader failure | Eventual consistency |
Leaderless
Uses quorum: W + R > N for consistency
| Advantages | Disadvantages |
|---|---|
| No SPOF | Quorum overhead |
| Highest availability | Conflict handling |
| Write anywhere | Read repair needed |
Failure Detection
Heartbeat
Parameters:
- Interval: Check frequency (5-30 seconds)
- Timeout: Wait duration (2-10 seconds)
- Threshold: Failed checks before unhealthy (2-5)
Gossip Protocol
Decentralized failure detection.
Operation:
- Each node maintains membership list
- Periodically shares list with random nodes
- Failure detected when heartbeat counter stops increasing
- Multiple nodes must agree before marking failed
Advantages: Decentralized, scalable, fault-tolerant Disadvantages: Eventual detection, bandwidth overhead
Failover Strategies
Cold Standby
Standby is off, started on failure.
| Recovery time | Minutes to hours | | Cost | Lowest | | Data loss | Possible (last backup) | | Use case | Non-critical systems |
Warm Standby
Standby runs but not serving traffic.
| Recovery time | Seconds to minutes | | Cost | Medium | | Data loss | Minimal (replication lag) | | Use case | Important systems |
Hot Standby
Standby fully synchronized, instant takeover.
| Recovery time | Seconds | | Cost | Highest | | Data loss | None | | Use case | Critical systems |
Comparison
Disaster Recovery
RPO and RTO
| Metric | Definition | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Max acceptable data loss | "1 hour of data loss acceptable" |
| RTO (Recovery Time Objective) | Max acceptable downtime | "Must recover in 4 hours" |
Multi-Region Architecture
DR Strategies
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | Low |
| Pilot Light | 10-30 min | Minutes | Medium |
| Warm Standby | Minutes | Seconds | High |
| Multi-Site Active | Near zero | Near zero | Highest |
Circuit Breaker Pattern
Prevents cascading failures by stopping requests to failing services.
States:
- Closed: Normal operation, tracking failures
- Open: Requests fail immediately (no calls to service)
- Half-Open: Allow one test request to check recovery
Parameters:
| Parameter | Purpose | Typical Value |
|---|---|---|
| Failure threshold | Failures to open | 5-10 failures |
| Timeout | Time in open state | 30-60 seconds |
| Success threshold | Successes to close | 2-5 successes |
Bulkhead Pattern
Isolates components to prevent failure spread.
Implementation:
- Separate thread pools per service
- Separate connection pools
- Resource limits per tenant
- Microservice isolation
Graceful Degradation
Maintains partial functionality during failures.
Examples:
| Component Failed | Graceful Degradation |
|---|---|
| Recommendation engine | Show popular items |
| Search service | Show category browse |
| Payment processor | Queue order for retry |
| Image service | Show placeholder |
| Analytics | Skip tracking |
SLA/SLO/SLI
| Term | Definition | Example |
|---|---|---|
| SLI | Metric measured | Request latency p99 |
| SLO | Internal target | 99.9% < 200ms |
| SLA | External commitment | 99.9% or credits |