Design a Data Pipeline
Concepts covered: CDC, Medallion Architecture (Bronze/Silver/Gold), dbt, Airflow, schema evolution, data contracts, idempotency, backfilling, lineage tracking
This design covers an end-to-end pipeline that ingests data from multiple sources, transforms it, and serves it to analysts and ML models.
Requirements
Functional Requirements
- Ingest data from multiple sources: databases, APIs, files, event streams
- Clean and transform data
- Load into warehouse for analytics
- Support batch and near-real-time use cases
- Track data lineage
- Handle schema changes without pipeline failures
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | Under 1 hour (batch), under 5 min (streaming) | Timely data for decision-making |
| Reliability | 99.9% pipeline success rate | Data availability for downstream consumers |
| Scalability | Handle 10x data growth | Accommodate annual data volume increases |
| Data quality | 99.99% accuracy | Correct data for analytics and reporting |
Scale Estimation
- 50 TB raw data per day
- 200+ source tables
- 1000+ downstream tables
- 500 concurrent analytical queries
These parameters influence architecture decisions. 50 TB/day requires distributed processing infrastructure.
High-Level Architecture
Data Ingestion Patterns
Pattern 1: Change Data Capture (CDC)
CDC streams insert, update, and delete operations from operational databases.
CDC characteristics:
| Approach | Latency | Source Load | Complexity |
|---|---|---|---|
| Full dump | Hours | High (full table scan) | Low |
| Incremental (timestamp) | Minutes-hours | Medium | Medium |
| CDC | Seconds-minutes | Low | High |
CDC is the preferred approach for database sources:
- Near real-time updates without polling
- Minimal load on production database (reads transaction log)
- Captures all change types including deletes
Pattern 2: API Ingestion
Third-party APIs (Salesforce, Stripe, HubSpot) require different handling.
API ingestion approach:
- Initialize with API client and storage destination
- Paginate through results using cursor-based pagination
- For each page: fetch records within the date range and write to storage partitioned by date
- Continue until no more pages (has_more is false)
- Log ingestion metadata including source name, record count, and timestamp for monitoring
Pattern 3: Event Stream Ingestion
High-volume event data (clickstream, logs, IoT) uses a separate path.
Transformation Layer
Medallion Architecture (Bronze/Silver/Gold)
Data quality tiers provide progressive refinement.
dbt Transformation Example
dbt executes SQL transformations inside the warehouse.
Silver layer (orders_cleaned): This incremental model processes raw orders from the bronze layer. It:
- Filters for only new records since the last run (using _ingested_at timestamp)
- Deduplicates by order_id using ROW_NUMBER, keeping the most recent version
- Cleans data: replaces null customer_id with 'UNKNOWN', casts types, normalizes status to uppercase
- Filters out invalid records (null order_id, negative amounts)
- Partitions by order_date for query performance
Gold layer (daily_revenue): This aggregation model summarizes cleaned orders by date:
- Counts distinct orders and customers per day
- Sums gross revenue
- Calculates refunds separately
- Computes net revenue (gross minus refunds)
- Includes only completed and refunded orders
Orchestration
Airflow DAG Example
This DAG waits for upstream ingestion, runs dbt models, tests results, and notifies downstream.
DAG configuration:
- Scheduled to run daily at 6 AM
- Retries failed tasks up to 3 times with 5-minute delays
- Sends email on failure
Task sequence:
- wait_for_ingestion: ExternalTaskSensor waits for the CDC ingestion DAG to complete (up to 1 hour timeout)
- run_silver: Executes dbt silver layer models tagged with 'silver' and 'orders'
- run_gold: Executes dbt gold layer models for aggregations
- run_tests: Runs dbt data quality tests on all orders models
- notify_completion: Sends notification upon successful completion
Tasks execute sequentially to ensure data quality at each stage before proceeding.
Data Quality
Quality Checks
| Check Type | Example | Execution Timing |
|---|---|---|
| Schema | Columns exist, types match | On ingestion |
| Null | Critical fields not null | Per batch |
| Uniqueness | Primary keys unique | Per batch |
| Referential | Foreign keys valid | Daily |
| Business rules | Amount > 0 | Per batch |
| Freshness | Data updated within SLA | Continuous |
Great Expectations Example
Creating an expectation suite:
- Define schema expectations: verify required columns exist and have correct types (e.g., order_date is DATE type)
- Add null checks: ensure critical columns like order_id are never null
- Add business rule validations: verify total_amount falls between 0 and 1,000,000
Validation process:
- Run the expectation suite against a data batch
- If any expectation fails, raise a DataQualityException with failure statistics
- This prevents bad data from propagating downstream
Schema Evolution
Strategy 1: Schema Registry
A schema registry validates new schemas for compatibility before allowing changes.
Strategy 2: Schema Evolution Modes
| Mode | Definition | Safe Changes |
|---|---|---|
| Backward | New code reads old data | Add optional column |
| Forward | Old code reads new data | Remove optional column |
| Full | Both directions supported | Add optional, remove optional |
Implementation Examples
Column additions are safe. Renames and type changes require migration.
Safe column addition (backward compatible): Add a new column with a default value. Existing code continues to work, new code can use the column.
Safe column removal (forward compatible): Three-step process: (1) stop writing to the column, (2) wait for all readers to update their queries, (3) drop the column.
Type change (requires data migration): Four-step process: (1) add new column with desired type, (2) backfill by casting existing values, (3) drop the old column, (4) rename the new column to the original name.
Monitoring and Alerting
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Pipeline success rate | 99.9% | Below 99% |
| Data freshness | Under 1 hour | Over 2 hours |
| Processing time | Under 30 min | Over 1 hour |
| Data volume | +/- 20% vs prior day | +/- 50% |
| Quality check pass rate | 100% | Below 99% |
Observability Stack
Interview Discussion Points
-
Requirements clarification: Establish latency requirements first, as they determine architecture choices.
-
Ingestion selection: CDC for databases (low source load, captures deletes, near real-time). Incremental pulls for APIs with idempotent writes.
-
Medallion architecture: Bronze is raw (exact source copy), Silver is cleaned and deduplicated, Gold is business-ready aggregates.
-
Data quality: Validate at every stage. Fail fast on critical issues, quarantine invalid data, alert on anomalies.
-
Operations: Monitoring, alerting, and lineage tracking are required for production pipelines.
Summary
| Decision | Options | Recommendation |
|---|---|---|
| Ingestion | Full dump, Incremental, CDC | CDC for databases, incremental for APIs |
| Storage | Data lake, Warehouse | Both: lake for raw data, warehouse for serving |
| Processing | Spark, dbt, Both | dbt for SQL transformations, Spark for complex processing |
| Orchestration | Airflow, Dagster, Prefect | Airflow for ecosystem maturity, Dagster for new projects |
| Quality | Great Expectations, dbt tests | Both (different validation types) |
| Format | Parquet, Delta, Iceberg | Delta or Iceberg (ACID transactions, time travel) |