Databases
Database selection depends on data structure, access patterns, consistency requirements, and scale. This section covers SQL and NoSQL databases, indexing, scaling strategies, and the CAP theorem.
SQL vs NoSQL
SQL (Relational) Databases
Examples: PostgreSQL, MySQL, SQLite
SQL databases store data in structured tables with defined schemas and support ACID transactions.
| Feature | Description |
|---|---|
| Schema enforcement | Data consistency through defined schemas |
| ACID transactions | Atomicity, Consistency, Isolation, Durability |
| JOINs | Query data across related tables |
| Standard query language | SQL is widely understood |
Appropriate use cases:
- Data with relationships between entities (users, orders, items)
- Transactions requiring ACID guarantees (financial systems)
- Ad-hoc query requirements (analytics, debugging)
- General-purpose applications where access patterns are not yet defined
NoSQL Databases
NoSQL encompasses multiple database types, each suited for different use cases.
| Type | Examples | Use Case |
|---|---|---|
| Document | MongoDB, CouchDB | Nested/hierarchical data, flexible schemas |
| Key-Value | Redis, DynamoDB | Fast lookups by key |
| Wide-Column | Cassandra, HBase | Time-series data, high write throughput |
| Graph | Neo4j, Neptune | Relationship-centric data (social graphs, recommendations) |
Appropriate use cases:
- Horizontal write scaling requirements
- Data that does not fit tabular structure
- Known, simple access patterns
- Applications that do not require JOINs or complex queries
Database Indexing
Indexes improve query performance by avoiding full table scans. Without indexes, the database must scan every row to find matching records. Indexes add write overhead because each write operation must update the index.
B-Tree Indexes
B-tree indexes are the default index type. They support equality comparisons, range queries, and sorting operations.
Hash Indexes
Hash indexes provide O(1) lookup performance for exact match queries (WHERE id = 123). They do not support range queries or sorting.
Composite Indexes
Composite indexes span multiple columns. Column order determines which queries can use the index.
Example: An index on (user_id, status) supports queries filtering on both columns or just user_id, but not queries filtering only on status.
| Query | Index Used |
|---|---|
WHERE user_id = 1 AND status = 'pending' | Yes |
WHERE user_id = 1 | Yes |
WHERE status = 'pending' | No |
The index is ordered by the first column, then the second. Queries filtering only on the second column cannot use the index efficiently.
Database Scaling
Vertical Scaling
Vertical scaling increases capacity by upgrading to more powerful hardware (more CPU, RAM, faster storage).
| Aspect | Description |
|---|---|
| Advantages | Simple implementation, no code changes required |
| Disadvantages | Hardware limits, single point of failure, cost increases non-linearly |
Read Replicas
Read replicas distribute read traffic across multiple database copies while the primary handles writes.
Replication lag consideration: Replicas may be slightly behind the primary. A user who writes data and immediately reads may not see their own write if the read hits a replica that has not yet received the update.
Sharding (Horizontal Partitioning)
Sharding distributes data across multiple databases based on a shard key.
| Strategy | Description | Advantages | Disadvantages |
|---|---|---|---|
| Range-based | Partition by value ranges | Simple to understand | Hot spots if data is not evenly distributed |
| Hash-based | Hash the key, mod by shard count | Even distribution | Cannot perform range queries across shards |
| Directory-based | Lookup table maps keys to shards | Maximum flexibility | Additional lookup required for each request |
Sharding considerations:
- JOINs across shards are not possible or are expensive
- Rebalancing shards requires data migration
- Foreign key constraints cannot span shards
- Application complexity increases significantly
Recommendation: Delay sharding until vertical scaling and read replicas are insufficient. Sharding adds complexity that is difficult to remove.
CAP Theorem
The CAP theorem states that a distributed database can provide at most two of three guarantees:
| Property | Description |
|---|---|
| Consistency | Every read returns the most recent write |
| Availability | Every request receives a response |
| Partition Tolerance | The system continues operating during network partitions |
Network partitions are unavoidable in distributed systems. During a partition, the system must choose between consistency (reject requests) or availability (serve potentially stale data).
| Database | CAP Classification |
|---|---|
| PostgreSQL | CP |
| MongoDB | CP (default) |
| Cassandra | AP |
| DynamoDB | Configurable |
Design Guidelines
| Guideline | Description |
|---|---|
| Define access patterns | Understand query requirements before selecting a database |
| Start with PostgreSQL | Relational databases handle more use cases than commonly assumed |
| Index strategically | Missing indexes degrade read performance; excessive indexes degrade write performance |
| Plan for growth | Design to support future scaling without requiring immediate implementation |
| Match consistency to requirements | Financial systems require strong consistency; social feeds can tolerate eventual consistency |