Reliability
Reliability refers to a system's ability to continue operating correctly in the presence of failures. This section covers availability metrics, fault tolerance, reliability patterns, and design principles.
Availability
Availability measures the percentage of time a system is operational.
| Availability | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
Each additional "nine" represents a 10x reduction in allowed downtime and typically requires exponentially more engineering effort and cost.
Fault Tolerance
Fault tolerance enables systems to continue operating despite component failures.
Failure Categories
| Category | Examples |
|---|---|
| Hardware | Disk failures, network issues, power outages |
| Software | Process crashes, memory leaks, bugs |
| Human | Configuration errors, deployment mistakes, incorrect scripts |
Reliability Patterns
1. Redundancy
Redundancy eliminates single points of failure by duplicating critical components.
Application layer:
- Multiple servers behind a load balancer
- Health checks to detect failed servers
- Automatic traffic routing to healthy instances
Database layer:
- Primary database with replicas
- Automatic failover when primary fails
- Multi-region deployment for disaster recovery
2. Replication
Replication maintains copies of data across multiple machines.
| Type | Description | Trade-off |
|---|---|---|
| Synchronous | Wait for replicas to confirm before acknowledging write | Consistent but slower |
| Asynchronous | Acknowledge write immediately, replicate in background | Faster but risk of data loss if primary fails before replication |
3. Health Checks and Monitoring
Health checks detect failures; monitoring provides visibility into system behavior.
Key metrics to monitor:
| Metric | Description |
|---|---|
| Latency | Response times, especially p99 (worst 1% of requests) |
| Error rates | 5xx errors, application-specific errors |
| Resource usage | CPU, memory, disk utilization |
| Business metrics | Transaction volume, user signups, order completion |
Configure alerts for anomalies to enable rapid incident response.
4. Circuit Breaker Pattern
Circuit breakers prevent cascading failures when downstream services fail.
Without a circuit breaker: Service A continues calling failing Service B, consuming resources and potentially causing both services to fail.
With a circuit breaker: Service A detects Service B failures and stops calling it, returning errors immediately.
States:
| State | Behavior |
|---|---|
| Closed | Normal operation; requests pass through |
| Open | Service is failing; requests fail immediately without calling downstream |
| Half-Open | Testing recovery; limited requests pass through |
5. Graceful Degradation
Graceful degradation provides reduced functionality when components fail, rather than complete failure.
| Failure Scenario | Degraded Response |
|---|---|
| Database slow | Serve cached (potentially stale) content |
| Recommendation service down | Show popular items instead of personalized recommendations |
| Heavy load | Disable non-critical features temporarily |
| Partial data available | Return available data with indication of incompleteness |
Design Principles
1. Assume Failure
Design with the assumption that every component will eventually fail:
- Networks will partition
- Servers will crash
- Disks will fail
- Third-party APIs will become unavailable
- Deployments will introduce bugs
2. Design for Recovery
Fast recovery minimizes the impact of failures.
| Practice | Description |
|---|---|
| Automate recovery | Auto-restart, auto-failover |
| Regular backups | Schedule backups and test restoration procedures |
| Runbooks | Document incident response procedures |
3. Test Failure Scenarios
Validate failure handling through deliberate testing.
Chaos engineering techniques:
- Terminate random server instances
- Inject network latency and packet loss
- Simulate region-level failures
Example: Netflix Chaos Monkey randomly terminates production instances to ensure the system handles failures gracefully.
4. Defense in Depth
Layer multiple failure protections:
| Protection | Description |
|---|---|
| Input validation | Reject malformed input |
| Rate limiting | Prevent resource exhaustion from excessive requests |
| Timeouts | Avoid indefinite waits for responses |
| Retries with backoff | Retry failed operations with increasing delays |
| Circuit breakers | Stop calling failing dependencies |
Design Considerations
| Topic | Considerations |
|---|---|
| Single points of failure | Identify components where failure causes complete system failure |
| Trade-offs | Higher availability increases cost and may reduce consistency |
| Cost analysis | Each nine of availability has a cost; align with business requirements |
| Monitoring strategy | Define metrics, thresholds, and alert recipients |
| Recovery procedures | Document failover procedures and test them regularly |