Skip to main content

Data Pipelines

A data pipeline moves data from source systems to destination systems, applying transformations as needed. Pipeline reliability determines data availability for downstream consumers.

ETL vs ELT

AspectETLELT
Transform locationBefore loading (staging infrastructure)After loading (warehouse)
ComputePipeline infrastructureWarehouse resources
Schema flexibilitySchema defined earlyRaw data preserved
Use caseStructured, stable dataExploration, evolving requirements

Modern data stacks predominantly use ELT. Cloud warehouses provide scalable compute, making in-warehouse transformations more cost-effective than dedicated transformation infrastructure.

Pipeline Architecture

A typical pipeline flows through five layers: Sources (databases, APIs, files) feed into Ingestion (Kafka, Fivetran) which passes to Transformation (Spark, dbt, Airflow) which loads into Storage (data lake, warehouse) and finally serves consumers (BI tools, ML models).

Each layer has a single responsibility, simplifying debugging and maintenance.

Orchestration

Orchestration defines task dependencies and execution order. Example: "Run transform after extract completes. Retry twice on failure."

Airflow DAG Example

An Airflow DAG (Directed Acyclic Graph) defines a daily ETL pipeline scheduled to run at 2 AM. It contains three tasks: extract, transform, and load. The dependency chain (extract before transform before load) ensures correct execution order. Each task uses a PythonOperator that calls a Python function to perform the actual work.

Common orchestration tools:

ToolDescription
AirflowMost widely adopted, large ecosystem
DagsterSoftware-defined assets, type checking
PrefectDynamic workflows, Python-native

Data Quality

Data quality issues propagate through pipelines and affect downstream consumers.

Validation Checks

Check TypeDescriptionExample
Schema validationColumn existence and typesExpected columns present
Null checksRequired fields populatedNon-null IDs
Range checksValues within boundsPositive revenue values
Freshness checksRecent data receivedNew data within 24 hours
Volume checksRow count within rangeWithin 20% of prior day

Validation implementation: A validation function checks that the dataframe has at least one row, all ID values are non-null, and all amount values fall within a valid range (0 to 1,000,000). Assertions raise errors if any check fails.

Reliability Patterns

PatternDescription
IdempotencyMultiple executions produce identical results; prevents duplicate data
Retry with backoffAutomatic retry with increasing delays between attempts
Dead letter queueFailed records stored separately for investigation
CheckpointingSave progress for recovery without full restart
Stage validationValidate data at each pipeline stage