Data Processing Pipeline

Data pipelines constitute approximately 90% of ML system infrastructure. Model code represents the remaining 10%. Data ingestion, cleaning, transformation, and serving determine model quality and reliability.

Pipeline Architecture

Loading diagram...

Data Ingestion

Batch Ingestion

Batch ingestion processes data at scheduled intervals. Suitable for data where daily or hourly freshness is acceptable.

Use Case	Description
Historical training data	Past events for model training
Daily aggregations	Metrics computed over time windows
Periodic exports	Data from external systems

Technologies:

Apache Spark for large-scale transformations
Apache Airflow for workflow orchestration
AWS Glue for AWS-native pipelines

Daily Batch Job Process:

Schedule the job to run at a fixed time (e.g., 2 AM daily)
Query the source database for yesterday's events
Apply data cleaning transformations
Compute features from the cleaned data
Write computed features to the feature store

Streaming Ingestion

Streaming ingestion processes data in real-time as events occur. Required when features must reflect recent user activity.

Use Case	Description
Real-time features	User clicks in last 5 minutes
Event-driven updates	Immediate propagation of changes
Low-latency serving	When batch staleness affects predictions

Technologies:

Apache Kafka for event streaming
Apache Flink for stream processing
Amazon Kinesis for AWS-native streaming

Streaming Pipeline Process:

For each incoming event, update real-time features immediately:

Increment counters (e.g., click_count for the user)
Update timestamps (e.g., last_active time)
Maintain rolling aggregations (e.g., clicks in last 5 minutes)

Data Processing

ETL vs ELT

Approach	Description	Use Case
ETL	Transform before loading	Structured data with known schemas
ELT	Load then transform	Exploratory analysis, schema evolution

Common Transformations

Data cleaning:

Missing value handling
Duplicate removal
Data type corrections
Outlier detection

Data validation checks:

Check Type	Description	Example
Schema validation	Columns match expected schema	Verify all expected columns are present
Range validation	Values fall within valid ranges	Age between 0 and 120
Null checks	Required fields are not null	user_id must not be null
Statistical checks	Aggregate statistics are reasonable	Mean price is positive

Feature computation:

Aggregations (count, sum, average)
Window functions (rolling averages)
Cross-source joins
Encoding transformations

Feature Store

A feature store provides centralized storage and serving for features. It ensures consistency between training and serving, enables feature reuse, and maintains point-in-time correctness.

Benefits

Benefit	Description
Feature reuse	Compute once, use across multiple models
Consistency	Training and serving use identical feature computation code
Point-in-time correctness	Features reflect their state at event time, not query time
Discovery	Teams can find and reuse existing features

Architecture

Loading diagram...

Feature Types

Type	Computation	Storage	Example
Batch	Scheduled jobs	Offline store	User lifetime value
Streaming	Real-time	Online store	Recent click count
On-demand	Request time	Computed	Current time features

Feature Definition

A feature group specification includes:

Component	Description	Example
Name	Identifier for the feature group	user_features
Entity	Primary key for feature lookup	user_id
Features	List of feature names and types	total_purchases (INT64), avg_order_value (DOUBLE)
Online flag	Whether to materialize to online store	True for real-time serving
TTL	How long features remain valid	24 hours

Data Quality

Validation Dimensions

Dimension	Description
Completeness	Null rate within expected bounds
Accuracy	Values reflect reality
Consistency	Same entity has consistent values across sources
Timeliness	Data freshness meets requirements
Uniqueness	No unintended duplicates

Validation Pipeline

Define an expectation suite with validation rules:

Expectation	Description
Column not null	user_id cannot be null
Column value range	age must be between 0 and 120
Column value set	country must be in list of valid countries
Table row count	Row count between 1,000 and 1,000,000

When validating a batch, run all expectations and alert the data team if any validation fails. Failed validations should block downstream processing until resolved.

Monitoring Metrics

Metric	Purpose
Missing value rates	Detect data source issues
Schema changes	Identify upstream modifications
Distribution shifts	Detect drift
Volume anomalies	Identify pipeline failures

Data Challenges

Missing Data

Strategy	Use Case
Drop rows	Few missing values, random missingness
Impute mean/median	Numerical features, random missingness
Impute mode	Categorical features
Model-based imputation	Complex patterns in missingness
Create indicator	Missingness is informative

Class Imbalance

Strategy	Description
Oversampling	Duplicate minority samples (SMOTE)
Undersampling	Remove majority samples
Class weights	Adjust loss function weights
Threshold tuning	Adjust decision boundary

Data Leakage

Data leakage occurs when information unavailable at prediction time is used during training.

Cause	Description
Future information	Using data from after the prediction time
Target leakage	Features derived from the label
Train-test contamination	Normalization fit on all data

Prevention:

Use time-based splits for temporal data
Verify each feature can be computed before outcome observation
Fit transformations on training data only

Reference

Topic	Approach
Feature consistency	Feature store with shared code
Late-arriving data	Backfill pipelines, retraining triggers, or robust feature designs
Data validation	Schema checks, statistical tests, alerts on anomalies
Real-time features	Streaming pipeline writing to key-value store
Schema evolution	Backwards-compatible changes, versioned schemas, migration scripts

Pipeline Architecture​

Data Ingestion​

Batch Ingestion​

Streaming Ingestion​

Data Processing​

ETL vs ELT​

Common Transformations​

Feature Store​

Benefits​

Architecture​

Feature Types​

Feature Definition​

Data Quality​

Validation Dimensions​

Validation Pipeline​

Monitoring Metrics​

Data Challenges​

Missing Data​

Class Imbalance​

Data Leakage​

Reference​

Table of Contents