Skip to main content

Availability Patterns

High availability ensures systems remain operational despite failures.

Measuring Availability

The Nines

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99% (two 9s)3.65 days7.31 hours1.68 hours
99.9% (three 9s)8.77 hours43.83 minutes10.08 minutes
99.99% (four 9s)52.60 minutes4.38 minutes1.01 minutes
99.999% (five 9s)5.26 minutes26.30 seconds6.05 seconds

Calculating Availability

Serial components (both must work):

Availability = A1 x A2

Example: Web server (99.9%) -> Database (99.9%)
Total = 0.999 x 0.999 = 99.8%

Parallel components (either works):

Availability = 1 - (1 - A1) x (1 - A2)

Example: Two web servers, each 99.9%
Total = 1 - (0.001 x 0.001) = 99.9999%
Loading diagram...

Redundancy Patterns

Active-Passive (Failover)

One active server, standby takes over on failure.

Loading diagram...

Operation:

  1. Primary handles all traffic
  2. Standby monitors primary via heartbeat
  3. On failure detection, standby becomes active
  4. DNS or VIP switches to new active
AdvantagesDisadvantages
Simple to implementStandby resources idle
Clear failover pathFailover time (seconds to minutes)
Works for stateful servicesData sync complexity

Active-Active

Multiple servers handle traffic simultaneously.

Loading diagram...

Operation:

  1. All servers handle traffic
  2. Load balancer distributes requests
  3. If one fails, others absorb traffic
  4. No explicit failover required
AdvantagesDisadvantages
Full resource utilizationMore complex (state sync)
No failover delayRequires stateless design
Better capacityLoad balancer is potential SPOF

Comparison

AspectActive-PassiveActive-Active
Resource usage50% (standby idle)100%
Failover timeSeconds-minutesInstant
ComplexityLowerHigher
ScalingReplace standbyAdd nodes
Use casesDatabases, statefulWeb servers, stateless

Replication Patterns

Single Leader

Loading diagram...
AdvantagesDisadvantages
Simple consistencyLeader is SPOF
Easy to understandWrite bottleneck
Conflict-freeFollower lag

Multi-Leader

Loading diagram...
AdvantagesDisadvantages
Write scalabilityConflict resolution
Geographic distributionComplexity
Tolerates leader failureEventual consistency

Leaderless

Loading diagram...

Uses quorum: W + R > N for consistency

AdvantagesDisadvantages
No SPOFQuorum overhead
Highest availabilityConflict handling
Write anywhereRead repair needed

Failure Detection

Heartbeat

Loading diagram...

Parameters:

  • Interval: Check frequency (5-30 seconds)
  • Timeout: Wait duration (2-10 seconds)
  • Threshold: Failed checks before unhealthy (2-5)

Gossip Protocol

Decentralized failure detection.

Loading diagram...

Operation:

  1. Each node maintains membership list
  2. Periodically shares list with random nodes
  3. Failure detected when heartbeat counter stops increasing
  4. Multiple nodes must agree before marking failed

Advantages: Decentralized, scalable, fault-tolerant Disadvantages: Eventual detection, bandwidth overhead

Failover Strategies

Cold Standby

Standby is off, started on failure.

| Recovery time | Minutes to hours | | Cost | Lowest | | Data loss | Possible (last backup) | | Use case | Non-critical systems |

Warm Standby

Standby runs but not serving traffic.

| Recovery time | Seconds to minutes | | Cost | Medium | | Data loss | Minimal (replication lag) | | Use case | Important systems |

Hot Standby

Standby fully synchronized, instant takeover.

| Recovery time | Seconds | | Cost | Highest | | Data loss | None | | Use case | Critical systems |

Comparison

Loading diagram...

Disaster Recovery

RPO and RTO

Loading diagram...
MetricDefinitionExample
RPO (Recovery Point Objective)Max acceptable data loss"1 hour of data loss acceptable"
RTO (Recovery Time Objective)Max acceptable downtime"Must recover in 4 hours"

Multi-Region Architecture

Loading diagram...

DR Strategies

StrategyRTORPOCost
Backup & RestoreHoursHoursLow
Pilot Light10-30 minMinutesMedium
Warm StandbyMinutesSecondsHigh
Multi-Site ActiveNear zeroNear zeroHighest

Circuit Breaker Pattern

Prevents cascading failures by stopping requests to failing services.

Loading diagram...

States:

  • Closed: Normal operation, tracking failures
  • Open: Requests fail immediately (no calls to service)
  • Half-Open: Allow one test request to check recovery

Parameters:

ParameterPurposeTypical Value
Failure thresholdFailures to open5-10 failures
TimeoutTime in open state30-60 seconds
Success thresholdSuccesses to close2-5 successes

Bulkhead Pattern

Isolates components to prevent failure spread.

Loading diagram...

Implementation:

  • Separate thread pools per service
  • Separate connection pools
  • Resource limits per tenant
  • Microservice isolation

Graceful Degradation

Maintains partial functionality during failures.

Loading diagram...

Examples:

Component FailedGraceful Degradation
Recommendation engineShow popular items
Search serviceShow category browse
Payment processorQueue order for retry
Image serviceShow placeholder
AnalyticsSkip tracking

SLA/SLO/SLI

Loading diagram...
TermDefinitionExample
SLIMetric measuredRequest latency p99
SLOInternal target99.9% < 200ms
SLAExternal commitment99.9% or credits