Data Pipelines

A data pipeline moves data from source systems to destination systems, applying transformations as needed. Pipeline reliability determines data availability for downstream consumers.

ETL vs ELT

Aspect	ETL	ELT
Transform location	Before loading (staging infrastructure)	After loading (warehouse)
Compute	Pipeline infrastructure	Warehouse resources
Schema flexibility	Schema defined early	Raw data preserved
Use case	Structured, stable data	Exploration, evolving requirements

Modern data stacks predominantly use ELT. Cloud warehouses provide scalable compute, making in-warehouse transformations more cost-effective than dedicated transformation infrastructure.

Pipeline Architecture

A typical pipeline flows through five layers: Sources (databases, APIs, files) feed into Ingestion (Kafka, Fivetran) which passes to Transformation (Spark, dbt, Airflow) which loads into Storage (data lake, warehouse) and finally serves consumers (BI tools, ML models).

Each layer has a single responsibility, simplifying debugging and maintenance.

Orchestration

Orchestration defines task dependencies and execution order. Example: "Run transform after extract completes. Retry twice on failure."

Airflow DAG Example

An Airflow DAG (Directed Acyclic Graph) defines a daily ETL pipeline scheduled to run at 2 AM. It contains three tasks: extract, transform, and load. The dependency chain (extract before transform before load) ensures correct execution order. Each task uses a PythonOperator that calls a Python function to perform the actual work.

Common orchestration tools:

Tool	Description
Airflow	Most widely adopted, large ecosystem
Dagster	Software-defined assets, type checking
Prefect	Dynamic workflows, Python-native

Data Quality

Data quality issues propagate through pipelines and affect downstream consumers.

Validation Checks

Check Type	Description	Example
Schema validation	Column existence and types	Expected columns present
Null checks	Required fields populated	Non-null IDs
Range checks	Values within bounds	Positive revenue values
Freshness checks	Recent data received	New data within 24 hours
Volume checks	Row count within range	Within 20% of prior day

Validation implementation: A validation function checks that the dataframe has at least one row, all ID values are non-null, and all amount values fall within a valid range (0 to 1,000,000). Assertions raise errors if any check fails.

Reliability Patterns

Pattern	Description
Idempotency	Multiple executions produce identical results; prevents duplicate data
Retry with backoff	Automatic retry with increasing delays between attempts
Dead letter queue	Failed records stored separately for investigation
Checkpointing	Save progress for recovery without full restart
Stage validation	Validate data at each pipeline stage

ETL vs ELT​

Pipeline Architecture​

Orchestration​

Airflow DAG Example​

Data Quality​

Validation Checks​

Reliability Patterns​

Table of Contents