Reliability

Reliability refers to a system's ability to continue operating correctly in the presence of failures. This section covers availability metrics, fault tolerance, reliability patterns, and design principles.

Availability

Availability measures the percentage of time a system is operational.

Availability	Downtime/Year	Downtime/Month
99%	3.65 days	7.3 hours
99.9%	8.76 hours	43.8 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

Each additional "nine" represents a 10x reduction in allowed downtime and typically requires exponentially more engineering effort and cost.

Fault Tolerance

Fault tolerance enables systems to continue operating despite component failures.

Failure Categories

Category	Examples
Hardware	Disk failures, network issues, power outages
Software	Process crashes, memory leaks, bugs
Human	Configuration errors, deployment mistakes, incorrect scripts

Reliability Patterns

1. Redundancy

Redundancy eliminates single points of failure by duplicating critical components.

Application layer:

Multiple servers behind a load balancer
Health checks to detect failed servers
Automatic traffic routing to healthy instances

Database layer:

Primary database with replicas
Automatic failover when primary fails
Multi-region deployment for disaster recovery

2. Replication

Replication maintains copies of data across multiple machines.

Type	Description	Trade-off
Synchronous	Wait for replicas to confirm before acknowledging write	Consistent but slower
Asynchronous	Acknowledge write immediately, replicate in background	Faster but risk of data loss if primary fails before replication

3. Health Checks and Monitoring

Health checks detect failures; monitoring provides visibility into system behavior.

Loading diagram...

Key metrics to monitor:

Metric	Description
Latency	Response times, especially p99 (worst 1% of requests)
Error rates	5xx errors, application-specific errors
Resource usage	CPU, memory, disk utilization
Business metrics	Transaction volume, user signups, order completion

Configure alerts for anomalies to enable rapid incident response.

4. Circuit Breaker Pattern

Circuit breakers prevent cascading failures when downstream services fail.

Without a circuit breaker: Service A continues calling failing Service B, consuming resources and potentially causing both services to fail.

With a circuit breaker: Service A detects Service B failures and stops calling it, returning errors immediately.

States:

State	Behavior
Closed	Normal operation; requests pass through
Open	Service is failing; requests fail immediately without calling downstream
Half-Open	Testing recovery; limited requests pass through

5. Graceful Degradation

Graceful degradation provides reduced functionality when components fail, rather than complete failure.

Failure Scenario	Degraded Response
Database slow	Serve cached (potentially stale) content
Recommendation service down	Show popular items instead of personalized recommendations
Heavy load	Disable non-critical features temporarily
Partial data available	Return available data with indication of incompleteness

Design Principles

1. Assume Failure

Design with the assumption that every component will eventually fail:

Networks will partition
Servers will crash
Disks will fail
Third-party APIs will become unavailable
Deployments will introduce bugs

2. Design for Recovery

Fast recovery minimizes the impact of failures.

Practice	Description
Automate recovery	Auto-restart, auto-failover
Regular backups	Schedule backups and test restoration procedures
Runbooks	Document incident response procedures

3. Test Failure Scenarios

Validate failure handling through deliberate testing.

Chaos engineering techniques:

Terminate random server instances
Inject network latency and packet loss
Simulate region-level failures

Example: Netflix Chaos Monkey randomly terminates production instances to ensure the system handles failures gracefully.

4. Defense in Depth

Layer multiple failure protections:

Protection	Description
Input validation	Reject malformed input
Rate limiting	Prevent resource exhaustion from excessive requests
Timeouts	Avoid indefinite waits for responses
Retries with backoff	Retry failed operations with increasing delays
Circuit breakers	Stop calling failing dependencies

Design Considerations

Topic	Considerations
Single points of failure	Identify components where failure causes complete system failure
Trade-offs	Higher availability increases cost and may reduce consistency
Cost analysis	Each nine of availability has a cost; align with business requirements
Monitoring strategy	Define metrics, thresholds, and alert recipients
Recovery procedures	Document failover procedures and test them regularly

Availability​

Fault Tolerance​

Failure Categories​

Reliability Patterns​

1. Redundancy​

2. Replication​

3. Health Checks and Monitoring​

4. Circuit Breaker Pattern​

5. Graceful Degradation​

Design Principles​

1. Assume Failure​

2. Design for Recovery​

3. Test Failure Scenarios​

4. Defense in Depth​

Design Considerations​

Table of Contents