Data Warehousing

Data warehouses are optimized for analytical queries over large datasets. Unlike operational databases designed for transaction processing, warehouses support complex aggregations and historical analysis.

OLTP vs OLAP

Aspect	OLTP	OLAP
Purpose	Transaction processing	Analytical queries
Queries	Simple, high volume	Complex, lower volume
Design	Normalized (3NF)	Denormalized (star schema)
Updates	Continuous, real-time	Batch loads

Production databases serve OLTP workloads. Data warehouses serve OLAP workloads. Mixing these workloads degrades performance for both.

Dimensional Modeling

Dimensional modeling organizes data around business questions rather than entity relationships.

Star Schema

A central fact table surrounded by dimension tables.

In a star schema, the central fact table (e.g., fact_sales) contains measures and foreign keys that reference multiple dimension tables (e.g., dim_date, dim_product, dim_store). The structure resembles a star with the fact table at the center and dimension tables radiating outward.

Star schema provides simple query patterns. One fact table joined to dimension tables. Modern warehouses optimize for this pattern.

Fact Tables

Fact tables contain quantitative data for analysis.

Component	Description
Measures	Numeric values to aggregate (revenue, quantity, clicks)
Foreign keys	References to dimension tables
Grain	Level of detail (one row per sale, per click, per impression)

Grain definition is critical. "One row per order line item" differs from "one row per order."

Dimension Tables

Dimension tables provide context for filtering and grouping.

Component	Description
Descriptive attributes	Product name, store city, customer segment
Slowly changing dimensions	Historical tracking of attribute changes
Hierarchies	Product to category to department

Slowly Changing Dimensions

Slowly changing dimensions (SCD) handle attribute changes over time.

Type	Behavior	History Preserved
Type 1	Overwrite current value	No
Type 2	Add new row with date range	Full
Type 3	Add column for previous value	Limited (one prior value)

Type 2 example:

A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).

Type 2 is the most common approach. It enables queries like "customer's city at time of order."

Modern Data Stack

Loading diagram...

The primary architectural change: transformations occur inside the warehouse (ELT) rather than before loading (ETL). Cloud warehouses provide scalable compute, making in-warehouse SQL transformations cost-effective.

Interview Topics

Topic	Key Points
Star vs Snowflake schema	Trade-offs of dimension normalization
Fact table design	Grain definition, additive vs semi-additive measures
Late-arriving data	Handling facts that arrive after dimension changes
Partitioning	Date-based partitioning for query performance
Query optimization	Partition pruning, avoiding full table scans

OLTP vs OLAP​

Dimensional Modeling​

Star Schema​

Fact Tables​

Dimension Tables​

Slowly Changing Dimensions​

Modern Data Stack​

Interview Topics​

Table of Contents