Data Warehousing
Data warehouses are optimized for analytical queries over large datasets. Unlike operational databases designed for transaction processing, warehouses support complex aggregations and historical analysis.
OLTP vs OLAP
| Aspect | OLTP | OLAP |
|---|---|---|
| Purpose | Transaction processing | Analytical queries |
| Queries | Simple, high volume | Complex, lower volume |
| Design | Normalized (3NF) | Denormalized (star schema) |
| Updates | Continuous, real-time | Batch loads |
Production databases serve OLTP workloads. Data warehouses serve OLAP workloads. Mixing these workloads degrades performance for both.
Dimensional Modeling
Dimensional modeling organizes data around business questions rather than entity relationships.
Star Schema
A central fact table surrounded by dimension tables.
In a star schema, the central fact table (e.g., fact_sales) contains measures and foreign keys that reference multiple dimension tables (e.g., dim_date, dim_product, dim_store). The structure resembles a star with the fact table at the center and dimension tables radiating outward.
Star schema provides simple query patterns. One fact table joined to dimension tables. Modern warehouses optimize for this pattern.
Fact Tables
Fact tables contain quantitative data for analysis.
| Component | Description |
|---|---|
| Measures | Numeric values to aggregate (revenue, quantity, clicks) |
| Foreign keys | References to dimension tables |
| Grain | Level of detail (one row per sale, per click, per impression) |
Grain definition is critical. "One row per order line item" differs from "one row per order."
Dimension Tables
Dimension tables provide context for filtering and grouping.
| Component | Description |
|---|---|
| Descriptive attributes | Product name, store city, customer segment |
| Slowly changing dimensions | Historical tracking of attribute changes |
| Hierarchies | Product to category to department |
Slowly Changing Dimensions
Slowly changing dimensions (SCD) handle attribute changes over time.
| Type | Behavior | History Preserved |
|---|---|---|
| Type 1 | Overwrite current value | No |
| Type 2 | Add new row with date range | Full |
| Type 3 | Add column for previous value | Limited (one prior value) |
Type 2 example:
A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).
Type 2 is the most common approach. It enables queries like "customer's city at time of order."
Modern Data Stack
The primary architectural change: transformations occur inside the warehouse (ELT) rather than before loading (ETL). Cloud warehouses provide scalable compute, making in-warehouse SQL transformations cost-effective.
Interview Topics
| Topic | Key Points |
|---|---|
| Star vs Snowflake schema | Trade-offs of dimension normalization |
| Fact table design | Grain definition, additive vs semi-additive measures |
| Late-arriving data | Handling facts that arrive after dimension changes |
| Partitioning | Date-based partitioning for query performance |
| Query optimization | Partition pruning, avoiding full table scans |