Skip to main content

Data Warehousing

Data warehouses are optimized for analytical queries over large datasets. Unlike operational databases designed for transaction processing, warehouses support complex aggregations and historical analysis.

OLTP vs OLAP

AspectOLTPOLAP
PurposeTransaction processingAnalytical queries
QueriesSimple, high volumeComplex, lower volume
DesignNormalized (3NF)Denormalized (star schema)
UpdatesContinuous, real-timeBatch loads

Production databases serve OLTP workloads. Data warehouses serve OLAP workloads. Mixing these workloads degrades performance for both.

Dimensional Modeling

Dimensional modeling organizes data around business questions rather than entity relationships.

Star Schema

A central fact table surrounded by dimension tables.

In a star schema, the central fact table (e.g., fact_sales) contains measures and foreign keys that reference multiple dimension tables (e.g., dim_date, dim_product, dim_store). The structure resembles a star with the fact table at the center and dimension tables radiating outward.

Star schema provides simple query patterns. One fact table joined to dimension tables. Modern warehouses optimize for this pattern.

Fact Tables

Fact tables contain quantitative data for analysis.

ComponentDescription
MeasuresNumeric values to aggregate (revenue, quantity, clicks)
Foreign keysReferences to dimension tables
GrainLevel of detail (one row per sale, per click, per impression)

Grain definition is critical. "One row per order line item" differs from "one row per order."

Dimension Tables

Dimension tables provide context for filtering and grouping.

ComponentDescription
Descriptive attributesProduct name, store city, customer segment
Slowly changing dimensionsHistorical tracking of attribute changes
HierarchiesProduct to category to department

Slowly Changing Dimensions

Slowly changing dimensions (SCD) handle attribute changes over time.

TypeBehaviorHistory Preserved
Type 1Overwrite current valueNo
Type 2Add new row with date rangeFull
Type 3Add column for previous valueLimited (one prior value)

Type 2 example:

A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).

Type 2 is the most common approach. It enables queries like "customer's city at time of order."

Modern Data Stack

Loading diagram...

The primary architectural change: transformations occur inside the warehouse (ELT) rather than before loading (ETL). Cloud warehouses provide scalable compute, making in-warehouse SQL transformations cost-effective.

Interview Topics

TopicKey Points
Star vs Snowflake schemaTrade-offs of dimension normalization
Fact table designGrain definition, additive vs semi-additive measures
Late-arriving dataHandling facts that arrive after dimension changes
PartitioningDate-based partitioning for query performance
Query optimizationPartition pruning, avoiding full table scans