Storage Engines
Storage engines determine how databases write data to disk and retrieve it. Different engines make different trade-offs between read performance, write throughput, and space efficiency.
Primary Storage Engine Types
Most databases use one of two approaches: B-trees or LSM-trees.
B-Trees
B-trees have been the standard since 1970 and power most relational databases. Data is organized in a tree of fixed-size pages (typically 4KB).
Write operation:
- Find the correct leaf page
- Update it in place
- If the page is full, split it
Read operation:
- Start at root
- Follow pointers down the tree
- Completes in O(log n) page reads
Characteristics:
- Predictable read performance
- Suitable for read-heavy workloads
- 50+ years of production use
Limitation: Every write touches the disk at a random location. If a record is updated 10 times, the entire page is rewritten 10 times.
Used by: PostgreSQL, MySQL/InnoDB, SQL Server, Oracle
LSM-Trees
LSM-trees (Log-Structured Merge-trees) batch writes in memory, then flush sorted runs to disk.
Write -> Memtable (RAM) -> SSTable (Disk) -> Compacted SSTables
|
WAL (for durability)
Write operation:
- Append to write-ahead log (durability)
- Insert into in-memory buffer (memtable)
- When buffer fills, flush to disk as sorted file (SSTable)
- Background process merges SSTables
Read operation:
- Check memtable first
- Check SSTables from newest to oldest
- Use bloom filters to skip files that do not contain the key
Characteristics:
- Writes are sequential (faster)
- Suitable for write-heavy workloads
- Better compression (sorted data compresses well)
Limitation: Reads may check multiple files. Compaction uses disk bandwidth.
Used by: Cassandra, RocksDB, LevelDB, HBase
Selection Criteria
| Workload | Recommended | Rationale |
|---|---|---|
| Read-heavy OLTP | B-tree | Consistent read latency |
| Write-heavy logging | LSM-tree | Sequential writes |
| Mixed workload | Depends | Test both |
| Analytics | Columnar | Different access pattern |
General guidance:
- Traditional web applications with more reads than writes: B-tree (PostgreSQL, MySQL)
- Event ingestion, logs, time-series data: LSM-tree (Cassandra, RocksDB)
Indexes
Indexes are separate data structures that enable row retrieval without full table scans.
Primary index: Determines how the actual data is stored. In a clustered index, rows are physically sorted by the primary key.
Secondary index: A separate structure pointing to the primary data. Multiple secondary indexes are possible, but each one increases write overhead.
Covering index: Includes extra columns to answer queries from the index alone, without accessing the main table.
Trade-off: Every index speeds up reads but slows down writes.
Column-Oriented Storage
For analytics, row-based storage (B-tree or LSM) is suboptimal.
Row-oriented storage:
Row 1: [user_id, name, email, created_at, last_login, ...]
Row 2: [user_id, name, email, created_at, last_login, ...]
Column-oriented storage:
user_ids: [1, 2, 3, 4, 5, ...]
names: [alice, bob, carol, ...]
emails: [a@x.com, b@x.com, ...]
For a query like SELECT AVG(age) FROM users, a row store reads every field for every user. A column store reads only the age column. For tables with 100 columns and billions of rows, this difference is significant.
Used by: Redshift, BigQuery, Snowflake, ClickHouse
In-Memory Databases
Keeping data in RAM provides significant speed improvements (RAM is approximately 100,000x faster than SSD for random access). In-memory databases still write to disk for durability.
- Redis: Periodic snapshots + append-only file (AOF)
- VoltDB: Replication + snapshots
- Memcached: No durability (pure cache)
Limitation: Storage capacity is limited to available memory.
Summary
| Engine Type | Read Performance | Write Performance | Use Case |
|---|---|---|---|
| B-trees | Consistent | Random writes | Traditional OLTP |
| LSM-trees | Variable | Sequential writes | Write-heavy workloads |
| Column stores | Analytics-optimized | Not for transactions | Data warehousing |
| In-memory | Fastest | Fastest | Cache, real-time |