Skip to main content

Design a Data Pipeline

Concepts covered: CDC, Medallion Architecture (Bronze/Silver/Gold), dbt, Airflow, schema evolution, data contracts, idempotency, backfilling, lineage tracking

This design covers an end-to-end pipeline that ingests data from multiple sources, transforms it, and serves it to analysts and ML models.

Requirements

Functional Requirements

  • Ingest data from multiple sources: databases, APIs, files, event streams
  • Clean and transform data
  • Load into warehouse for analytics
  • Support batch and near-real-time use cases
  • Track data lineage
  • Handle schema changes without pipeline failures

Non-Functional Requirements

RequirementTargetRationale
LatencyUnder 1 hour (batch), under 5 min (streaming)Timely data for decision-making
Reliability99.9% pipeline success rateData availability for downstream consumers
ScalabilityHandle 10x data growthAccommodate annual data volume increases
Data quality99.99% accuracyCorrect data for analytics and reporting

Scale Estimation

  • 50 TB raw data per day
  • 200+ source tables
  • 1000+ downstream tables
  • 500 concurrent analytical queries

These parameters influence architecture decisions. 50 TB/day requires distributed processing infrastructure.

High-Level Architecture

Loading diagram...

Data Ingestion Patterns

Pattern 1: Change Data Capture (CDC)

CDC streams insert, update, and delete operations from operational databases.

Loading diagram...

CDC characteristics:

ApproachLatencySource LoadComplexity
Full dumpHoursHigh (full table scan)Low
Incremental (timestamp)Minutes-hoursMediumMedium
CDCSeconds-minutesLowHigh

CDC is the preferred approach for database sources:

  • Near real-time updates without polling
  • Minimal load on production database (reads transaction log)
  • Captures all change types including deletes

Pattern 2: API Ingestion

Third-party APIs (Salesforce, Stripe, HubSpot) require different handling.

API ingestion approach:

  1. Initialize with API client and storage destination
  2. Paginate through results using cursor-based pagination
  3. For each page: fetch records within the date range and write to storage partitioned by date
  4. Continue until no more pages (has_more is false)
  5. Log ingestion metadata including source name, record count, and timestamp for monitoring

Pattern 3: Event Stream Ingestion

High-volume event data (clickstream, logs, IoT) uses a separate path.

Loading diagram...

Transformation Layer

Medallion Architecture (Bronze/Silver/Gold)

Data quality tiers provide progressive refinement.

Loading diagram...

dbt Transformation Example

dbt executes SQL transformations inside the warehouse.

Silver layer (orders_cleaned): This incremental model processes raw orders from the bronze layer. It:

  1. Filters for only new records since the last run (using _ingested_at timestamp)
  2. Deduplicates by order_id using ROW_NUMBER, keeping the most recent version
  3. Cleans data: replaces null customer_id with 'UNKNOWN', casts types, normalizes status to uppercase
  4. Filters out invalid records (null order_id, negative amounts)
  5. Partitions by order_date for query performance

Gold layer (daily_revenue): This aggregation model summarizes cleaned orders by date:

  1. Counts distinct orders and customers per day
  2. Sums gross revenue
  3. Calculates refunds separately
  4. Computes net revenue (gross minus refunds)
  5. Includes only completed and refunded orders

Orchestration

Airflow DAG Example

This DAG waits for upstream ingestion, runs dbt models, tests results, and notifies downstream.

DAG configuration:

  • Scheduled to run daily at 6 AM
  • Retries failed tasks up to 3 times with 5-minute delays
  • Sends email on failure

Task sequence:

  1. wait_for_ingestion: ExternalTaskSensor waits for the CDC ingestion DAG to complete (up to 1 hour timeout)
  2. run_silver: Executes dbt silver layer models tagged with 'silver' and 'orders'
  3. run_gold: Executes dbt gold layer models for aggregations
  4. run_tests: Runs dbt data quality tests on all orders models
  5. notify_completion: Sends notification upon successful completion

Tasks execute sequentially to ensure data quality at each stage before proceeding.

Data Quality

Quality Checks

Check TypeExampleExecution Timing
SchemaColumns exist, types matchOn ingestion
NullCritical fields not nullPer batch
UniquenessPrimary keys uniquePer batch
ReferentialForeign keys validDaily
Business rulesAmount > 0Per batch
FreshnessData updated within SLAContinuous

Great Expectations Example

Creating an expectation suite:

  1. Define schema expectations: verify required columns exist and have correct types (e.g., order_date is DATE type)
  2. Add null checks: ensure critical columns like order_id are never null
  3. Add business rule validations: verify total_amount falls between 0 and 1,000,000

Validation process:

  • Run the expectation suite against a data batch
  • If any expectation fails, raise a DataQualityException with failure statistics
  • This prevents bad data from propagating downstream

Schema Evolution

Strategy 1: Schema Registry

A schema registry validates new schemas for compatibility before allowing changes.

Loading diagram...

Strategy 2: Schema Evolution Modes

ModeDefinitionSafe Changes
BackwardNew code reads old dataAdd optional column
ForwardOld code reads new dataRemove optional column
FullBoth directions supportedAdd optional, remove optional

Implementation Examples

Column additions are safe. Renames and type changes require migration.

Safe column addition (backward compatible): Add a new column with a default value. Existing code continues to work, new code can use the column.

Safe column removal (forward compatible): Three-step process: (1) stop writing to the column, (2) wait for all readers to update their queries, (3) drop the column.

Type change (requires data migration): Four-step process: (1) add new column with desired type, (2) backfill by casting existing values, (3) drop the old column, (4) rename the new column to the original name.

Monitoring and Alerting

Key Metrics

MetricTargetAlert Threshold
Pipeline success rate99.9%Below 99%
Data freshnessUnder 1 hourOver 2 hours
Processing timeUnder 30 minOver 1 hour
Data volume+/- 20% vs prior day+/- 50%
Quality check pass rate100%Below 99%

Observability Stack

Loading diagram...

Interview Discussion Points

  1. Requirements clarification: Establish latency requirements first, as they determine architecture choices.

  2. Ingestion selection: CDC for databases (low source load, captures deletes, near real-time). Incremental pulls for APIs with idempotent writes.

  3. Medallion architecture: Bronze is raw (exact source copy), Silver is cleaned and deduplicated, Gold is business-ready aggregates.

  4. Data quality: Validate at every stage. Fail fast on critical issues, quarantine invalid data, alert on anomalies.

  5. Operations: Monitoring, alerting, and lineage tracking are required for production pipelines.

Summary

DecisionOptionsRecommendation
IngestionFull dump, Incremental, CDCCDC for databases, incremental for APIs
StorageData lake, WarehouseBoth: lake for raw data, warehouse for serving
ProcessingSpark, dbt, Bothdbt for SQL transformations, Spark for complex processing
OrchestrationAirflow, Dagster, PrefectAirflow for ecosystem maturity, Dagster for new projects
QualityGreat Expectations, dbt testsBoth (different validation types)
FormatParquet, Delta, IcebergDelta or Iceberg (ACID transactions, time travel)