Data Pipelines
A data pipeline moves data from source systems to destination systems, applying transformations as needed. Pipeline reliability determines data availability for downstream consumers.
ETL vs ELT
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | Before loading (staging infrastructure) | After loading (warehouse) |
| Compute | Pipeline infrastructure | Warehouse resources |
| Schema flexibility | Schema defined early | Raw data preserved |
| Use case | Structured, stable data | Exploration, evolving requirements |
Modern data stacks predominantly use ELT. Cloud warehouses provide scalable compute, making in-warehouse transformations more cost-effective than dedicated transformation infrastructure.
Pipeline Architecture
A typical pipeline flows through five layers: Sources (databases, APIs, files) feed into Ingestion (Kafka, Fivetran) which passes to Transformation (Spark, dbt, Airflow) which loads into Storage (data lake, warehouse) and finally serves consumers (BI tools, ML models).
Each layer has a single responsibility, simplifying debugging and maintenance.
Orchestration
Orchestration defines task dependencies and execution order. Example: "Run transform after extract completes. Retry twice on failure."
Airflow DAG Example
An Airflow DAG (Directed Acyclic Graph) defines a daily ETL pipeline scheduled to run at 2 AM. It contains three tasks: extract, transform, and load. The dependency chain (extract before transform before load) ensures correct execution order. Each task uses a PythonOperator that calls a Python function to perform the actual work.
Common orchestration tools:
| Tool | Description |
|---|---|
| Airflow | Most widely adopted, large ecosystem |
| Dagster | Software-defined assets, type checking |
| Prefect | Dynamic workflows, Python-native |
Data Quality
Data quality issues propagate through pipelines and affect downstream consumers.
Validation Checks
| Check Type | Description | Example |
|---|---|---|
| Schema validation | Column existence and types | Expected columns present |
| Null checks | Required fields populated | Non-null IDs |
| Range checks | Values within bounds | Positive revenue values |
| Freshness checks | Recent data received | New data within 24 hours |
| Volume checks | Row count within range | Within 20% of prior day |
Validation implementation: A validation function checks that the dataframe has at least one row, all ID values are non-null, and all amount values fall within a valid range (0 to 1,000,000). Assertions raise errors if any check fails.
Reliability Patterns
| Pattern | Description |
|---|---|
| Idempotency | Multiple executions produce identical results; prevents duplicate data |
| Retry with backoff | Automatic retry with increasing delays between attempts |
| Dead letter queue | Failed records stored separately for investigation |
| Checkpointing | Save progress for recovery without full restart |
| Stage validation | Validate data at each pipeline stage |