Skip to main content

Data Processing Pipeline

Data pipelines constitute approximately 90% of ML system infrastructure. Model code represents the remaining 10%. Data ingestion, cleaning, transformation, and serving determine model quality and reliability.

Pipeline Architecture

Loading diagram...

Data Ingestion

Batch Ingestion

Batch ingestion processes data at scheduled intervals. Suitable for data where daily or hourly freshness is acceptable.

Use CaseDescription
Historical training dataPast events for model training
Daily aggregationsMetrics computed over time windows
Periodic exportsData from external systems

Technologies:

  • Apache Spark for large-scale transformations
  • Apache Airflow for workflow orchestration
  • AWS Glue for AWS-native pipelines

Daily Batch Job Process:

  1. Schedule the job to run at a fixed time (e.g., 2 AM daily)
  2. Query the source database for yesterday's events
  3. Apply data cleaning transformations
  4. Compute features from the cleaned data
  5. Write computed features to the feature store

Streaming Ingestion

Streaming ingestion processes data in real-time as events occur. Required when features must reflect recent user activity.

Use CaseDescription
Real-time featuresUser clicks in last 5 minutes
Event-driven updatesImmediate propagation of changes
Low-latency servingWhen batch staleness affects predictions

Technologies:

  • Apache Kafka for event streaming
  • Apache Flink for stream processing
  • Amazon Kinesis for AWS-native streaming

Streaming Pipeline Process:

For each incoming event, update real-time features immediately:

  • Increment counters (e.g., click_count for the user)
  • Update timestamps (e.g., last_active time)
  • Maintain rolling aggregations (e.g., clicks in last 5 minutes)

Data Processing

ETL vs ELT

ApproachDescriptionUse Case
ETLTransform before loadingStructured data with known schemas
ELTLoad then transformExploratory analysis, schema evolution

Common Transformations

Data cleaning:

  • Missing value handling
  • Duplicate removal
  • Data type corrections
  • Outlier detection

Data validation checks:

Check TypeDescriptionExample
Schema validationColumns match expected schemaVerify all expected columns are present
Range validationValues fall within valid rangesAge between 0 and 120
Null checksRequired fields are not nulluser_id must not be null
Statistical checksAggregate statistics are reasonableMean price is positive

Feature computation:

  • Aggregations (count, sum, average)
  • Window functions (rolling averages)
  • Cross-source joins
  • Encoding transformations

Feature Store

A feature store provides centralized storage and serving for features. It ensures consistency between training and serving, enables feature reuse, and maintains point-in-time correctness.

Benefits

BenefitDescription
Feature reuseCompute once, use across multiple models
ConsistencyTraining and serving use identical feature computation code
Point-in-time correctnessFeatures reflect their state at event time, not query time
DiscoveryTeams can find and reuse existing features

Architecture

Loading diagram...

Feature Types

TypeComputationStorageExample
BatchScheduled jobsOffline storeUser lifetime value
StreamingReal-timeOnline storeRecent click count
On-demandRequest timeComputedCurrent time features

Feature Definition

A feature group specification includes:

ComponentDescriptionExample
NameIdentifier for the feature groupuser_features
EntityPrimary key for feature lookupuser_id
FeaturesList of feature names and typestotal_purchases (INT64), avg_order_value (DOUBLE)
Online flagWhether to materialize to online storeTrue for real-time serving
TTLHow long features remain valid24 hours

Data Quality

Validation Dimensions

DimensionDescription
CompletenessNull rate within expected bounds
AccuracyValues reflect reality
ConsistencySame entity has consistent values across sources
TimelinessData freshness meets requirements
UniquenessNo unintended duplicates

Validation Pipeline

Define an expectation suite with validation rules:

ExpectationDescription
Column not nulluser_id cannot be null
Column value rangeage must be between 0 and 120
Column value setcountry must be in list of valid countries
Table row countRow count between 1,000 and 1,000,000

When validating a batch, run all expectations and alert the data team if any validation fails. Failed validations should block downstream processing until resolved.

Monitoring Metrics

MetricPurpose
Missing value ratesDetect data source issues
Schema changesIdentify upstream modifications
Distribution shiftsDetect drift
Volume anomaliesIdentify pipeline failures

Data Challenges

Missing Data

StrategyUse Case
Drop rowsFew missing values, random missingness
Impute mean/medianNumerical features, random missingness
Impute modeCategorical features
Model-based imputationComplex patterns in missingness
Create indicatorMissingness is informative

Class Imbalance

StrategyDescription
OversamplingDuplicate minority samples (SMOTE)
UndersamplingRemove majority samples
Class weightsAdjust loss function weights
Threshold tuningAdjust decision boundary

Data Leakage

Data leakage occurs when information unavailable at prediction time is used during training.

CauseDescription
Future informationUsing data from after the prediction time
Target leakageFeatures derived from the label
Train-test contaminationNormalization fit on all data

Prevention:

  • Use time-based splits for temporal data
  • Verify each feature can be computed before outcome observation
  • Fit transformations on training data only

Reference

TopicApproach
Feature consistencyFeature store with shared code
Late-arriving dataBackfill pipelines, retraining triggers, or robust feature designs
Data validationSchema checks, statistical tests, alerts on anomalies
Real-time featuresStreaming pipeline writing to key-value store
Schema evolutionBackwards-compatible changes, versioned schemas, migration scripts