Database Scaling
Database scaling addresses the bottleneck that occurs when a single database cannot handle application load.
Scaling Strategies Overview
| Strategy | Scaling Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| Vertical | Larger hardware | Simple, no code changes | Hardware limits, expensive |
| Replication | Increased read capacity | Read scalability, high availability | Write bottleneck remains |
| Sharding | Data distribution | Near-linear scaling | Complex queries, joins |
| Federation | Functional split | Independent scaling | Cross-function queries difficult |
Replication
Master-Slave (Primary-Replica)
One master handles writes; replicas handle reads.
Operation:
- All writes go to master
- Master logs changes to binary log
- Replicas pull and apply changes
- Reads distributed across replicas
Advantages:
- Read scalability (add more replicas)
- High availability (promote replica if master fails)
- Backups without impacting master
Disadvantages:
- Write bottleneck (single master)
- Replication lag (eventual consistency)
- Failover complexity
Master-Master (Multi-Primary)
Multiple nodes accept writes.
Advantages:
- Write scalability
- Geographic distribution
- No single point of failure
Disadvantages:
- Conflict resolution required
- Increased latency for consistency
- Complex management
Replication Modes
| Mode | Consistency | Latency | Data Safety |
|---|---|---|---|
| Synchronous | Strong | Higher | No data loss |
| Asynchronous | Eventual | Lower | Possible data loss |
| Semi-synchronous | Strong for 1 replica | Medium | Balanced |
Sharding (Horizontal Partitioning)
Sharding splits data across multiple databases based on a shard key.
Sharding Strategies
Range-Based Sharding
Advantages: Efficient range queries, simple implementation Disadvantages: Hot spots with skewed data, uneven distribution
Hash-Based Sharding
Advantages: Even distribution, no hot spots Disadvantages: Range queries span all shards, resharding is expensive
Directory-Based Sharding
Advantages: Flexible mapping, easy resharding Disadvantages: Lookup service is single point of failure, extra latency
Sharding Comparison
| Strategy | Distribution | Range Queries | Resharding | Complexity |
|---|---|---|---|---|
| Range | Can be uneven | Efficient | Medium | Low |
| Hash | Even | Expensive | Hard | Medium |
| Directory | Flexible | Via lookup | Easy | High |
Shard Key Selection
| Good Shard Key Properties | Bad Shard Key Properties |
|---|---|
| High cardinality (many values) | Low cardinality (few values) |
| Even distribution | Skewed distribution |
| Frequently used in queries | Rarely queried |
| Immutable | Changes frequently |
Examples:
- Suitable: user_id, order_id, tenant_id
- Unsuitable: country (skewed), created_date (hot spot on current date)
Sharding Challenges
Cross-Shard Queries
Problem: Query must hit all shards and merge results Solution: Denormalize data to keep related data on same shard
Cross-Shard Joins
Solutions:
- Denormalize (store user info in orders)
- Application-level joins
- Use same shard key for related data
Resharding
| Approach | Downtime | Complexity |
|---|---|---|
| Stop writes, migrate | Yes | Low |
| Double-write during migration | No | High |
| Consistent hashing | Minimal | Medium |
Federation (Functional Partitioning)
Federation splits databases by function or domain.
Advantages:
- Independent scaling per function
- Smaller databases, faster queries
- Team autonomy
Disadvantages:
- Cross-function joins impossible
- More databases to manage
- Application complexity
Denormalization
Denormalization trades storage for read performance by duplicating data.
Normal vs Denormalized
| Aspect | Normalized | Denormalized |
|---|---|---|
| Storage | Minimal | Duplicated |
| Write speed | Faster | Slower (update copies) |
| Read speed | Slower (joins) | Faster (single query) |
| Consistency | Easy | Must sync duplicates |
Appropriate use cases:
- Read-heavy workloads (10:1+ read:write ratio)
- Expensive joins required frequently
- Data changes infrequently
SQL vs NoSQL
| Aspect | SQL (Relational) | NoSQL |
|---|---|---|
| Schema | Fixed, structured | Flexible, dynamic |
| Scaling | Vertical (primarily) | Horizontal (designed for it) |
| Consistency | ACID | BASE (typically) |
| Joins | Native support | Application-level |
| Use cases | Complex queries, transactions | Simple queries, high scale |
NoSQL Types
| Type | Data Model | Examples | Use Cases |
|---|---|---|---|
| Key-Value | Simple key to value | Redis, DynamoDB | Caching, sessions |
| Document | JSON documents | MongoDB, Couchbase | Content, catalogs |
| Wide Column | Column families | Cassandra, HBase | Time series, IoT |
| Graph | Nodes and edges | Neo4j, Neptune | Social networks, recommendations |
Scaling Decision Tree
Production Examples
| Company | Strategy | Scale |
|---|---|---|
| MySQL + sharding by user_id | Billions of users | |
| PostgreSQL + sharding | 2B+ monthly users | |
| Uber | MySQL to Schemaless (custom) | Millions of trips/day |
| Netflix | Cassandra (NoSQL) | 200M+ subscribers |
| MySQL sharding + HBase | Billions of pins |
Performance Reference
| Configuration | Approximate Throughput |
|---|---|
| Single MySQL | ~10K queries/second |
| With replicas | Multiply read capacity |
| Sharded | Near-linear scaling |