Skip to main content

Reliability

Reliability refers to a system's ability to continue operating correctly in the presence of failures. This section covers availability metrics, fault tolerance, reliability patterns, and design principles.

Availability

Availability measures the percentage of time a system is operational.

AvailabilityDowntime/YearDowntime/Month
99%3.65 days7.3 hours
99.9%8.76 hours43.8 minutes
99.99%52.6 minutes4.38 minutes
99.999%5.26 minutes26.3 seconds

Each additional "nine" represents a 10x reduction in allowed downtime and typically requires exponentially more engineering effort and cost.

Fault Tolerance

Fault tolerance enables systems to continue operating despite component failures.

Failure Categories

CategoryExamples
HardwareDisk failures, network issues, power outages
SoftwareProcess crashes, memory leaks, bugs
HumanConfiguration errors, deployment mistakes, incorrect scripts

Reliability Patterns

1. Redundancy

Redundancy eliminates single points of failure by duplicating critical components.

Application layer:

  • Multiple servers behind a load balancer
  • Health checks to detect failed servers
  • Automatic traffic routing to healthy instances

Database layer:

  • Primary database with replicas
  • Automatic failover when primary fails
  • Multi-region deployment for disaster recovery

2. Replication

Replication maintains copies of data across multiple machines.

TypeDescriptionTrade-off
SynchronousWait for replicas to confirm before acknowledging writeConsistent but slower
AsynchronousAcknowledge write immediately, replicate in backgroundFaster but risk of data loss if primary fails before replication

3. Health Checks and Monitoring

Health checks detect failures; monitoring provides visibility into system behavior.

Loading diagram...

Key metrics to monitor:

MetricDescription
LatencyResponse times, especially p99 (worst 1% of requests)
Error rates5xx errors, application-specific errors
Resource usageCPU, memory, disk utilization
Business metricsTransaction volume, user signups, order completion

Configure alerts for anomalies to enable rapid incident response.

4. Circuit Breaker Pattern

Circuit breakers prevent cascading failures when downstream services fail.

Without a circuit breaker: Service A continues calling failing Service B, consuming resources and potentially causing both services to fail.

With a circuit breaker: Service A detects Service B failures and stops calling it, returning errors immediately.

States:

StateBehavior
ClosedNormal operation; requests pass through
OpenService is failing; requests fail immediately without calling downstream
Half-OpenTesting recovery; limited requests pass through

5. Graceful Degradation

Graceful degradation provides reduced functionality when components fail, rather than complete failure.

Failure ScenarioDegraded Response
Database slowServe cached (potentially stale) content
Recommendation service downShow popular items instead of personalized recommendations
Heavy loadDisable non-critical features temporarily
Partial data availableReturn available data with indication of incompleteness

Design Principles

1. Assume Failure

Design with the assumption that every component will eventually fail:

  • Networks will partition
  • Servers will crash
  • Disks will fail
  • Third-party APIs will become unavailable
  • Deployments will introduce bugs

2. Design for Recovery

Fast recovery minimizes the impact of failures.

PracticeDescription
Automate recoveryAuto-restart, auto-failover
Regular backupsSchedule backups and test restoration procedures
RunbooksDocument incident response procedures

3. Test Failure Scenarios

Validate failure handling through deliberate testing.

Chaos engineering techniques:

  • Terminate random server instances
  • Inject network latency and packet loss
  • Simulate region-level failures

Example: Netflix Chaos Monkey randomly terminates production instances to ensure the system handles failures gracefully.

4. Defense in Depth

Layer multiple failure protections:

ProtectionDescription
Input validationReject malformed input
Rate limitingPrevent resource exhaustion from excessive requests
TimeoutsAvoid indefinite waits for responses
Retries with backoffRetry failed operations with increasing delays
Circuit breakersStop calling failing dependencies

Design Considerations

TopicConsiderations
Single points of failureIdentify components where failure causes complete system failure
Trade-offsHigher availability increases cost and may reduce consistency
Cost analysisEach nine of availability has a cost; align with business requirements
Monitoring strategyDefine metrics, thresholds, and alert recipients
Recovery proceduresDocument failover procedures and test them regularly