Data Processing Pipeline
Data pipelines constitute approximately 90% of ML system infrastructure. Model code represents the remaining 10%. Data ingestion, cleaning, transformation, and serving determine model quality and reliability.
Pipeline Architecture
Data Ingestion
Batch Ingestion
Batch ingestion processes data at scheduled intervals. Suitable for data where daily or hourly freshness is acceptable.
| Use Case | Description |
|---|---|
| Historical training data | Past events for model training |
| Daily aggregations | Metrics computed over time windows |
| Periodic exports | Data from external systems |
Technologies:
- Apache Spark for large-scale transformations
- Apache Airflow for workflow orchestration
- AWS Glue for AWS-native pipelines
Daily Batch Job Process:
- Schedule the job to run at a fixed time (e.g., 2 AM daily)
- Query the source database for yesterday's events
- Apply data cleaning transformations
- Compute features from the cleaned data
- Write computed features to the feature store
Streaming Ingestion
Streaming ingestion processes data in real-time as events occur. Required when features must reflect recent user activity.
| Use Case | Description |
|---|---|
| Real-time features | User clicks in last 5 minutes |
| Event-driven updates | Immediate propagation of changes |
| Low-latency serving | When batch staleness affects predictions |
Technologies:
- Apache Kafka for event streaming
- Apache Flink for stream processing
- Amazon Kinesis for AWS-native streaming
Streaming Pipeline Process:
For each incoming event, update real-time features immediately:
- Increment counters (e.g., click_count for the user)
- Update timestamps (e.g., last_active time)
- Maintain rolling aggregations (e.g., clicks in last 5 minutes)
Data Processing
ETL vs ELT
| Approach | Description | Use Case |
|---|---|---|
| ETL | Transform before loading | Structured data with known schemas |
| ELT | Load then transform | Exploratory analysis, schema evolution |
Common Transformations
Data cleaning:
- Missing value handling
- Duplicate removal
- Data type corrections
- Outlier detection
Data validation checks:
| Check Type | Description | Example |
|---|---|---|
| Schema validation | Columns match expected schema | Verify all expected columns are present |
| Range validation | Values fall within valid ranges | Age between 0 and 120 |
| Null checks | Required fields are not null | user_id must not be null |
| Statistical checks | Aggregate statistics are reasonable | Mean price is positive |
Feature computation:
- Aggregations (count, sum, average)
- Window functions (rolling averages)
- Cross-source joins
- Encoding transformations
Feature Store
A feature store provides centralized storage and serving for features. It ensures consistency between training and serving, enables feature reuse, and maintains point-in-time correctness.
Benefits
| Benefit | Description |
|---|---|
| Feature reuse | Compute once, use across multiple models |
| Consistency | Training and serving use identical feature computation code |
| Point-in-time correctness | Features reflect their state at event time, not query time |
| Discovery | Teams can find and reuse existing features |
Architecture
Feature Types
| Type | Computation | Storage | Example |
|---|---|---|---|
| Batch | Scheduled jobs | Offline store | User lifetime value |
| Streaming | Real-time | Online store | Recent click count |
| On-demand | Request time | Computed | Current time features |
Feature Definition
A feature group specification includes:
| Component | Description | Example |
|---|---|---|
| Name | Identifier for the feature group | user_features |
| Entity | Primary key for feature lookup | user_id |
| Features | List of feature names and types | total_purchases (INT64), avg_order_value (DOUBLE) |
| Online flag | Whether to materialize to online store | True for real-time serving |
| TTL | How long features remain valid | 24 hours |
Data Quality
Validation Dimensions
| Dimension | Description |
|---|---|
| Completeness | Null rate within expected bounds |
| Accuracy | Values reflect reality |
| Consistency | Same entity has consistent values across sources |
| Timeliness | Data freshness meets requirements |
| Uniqueness | No unintended duplicates |
Validation Pipeline
Define an expectation suite with validation rules:
| Expectation | Description |
|---|---|
| Column not null | user_id cannot be null |
| Column value range | age must be between 0 and 120 |
| Column value set | country must be in list of valid countries |
| Table row count | Row count between 1,000 and 1,000,000 |
When validating a batch, run all expectations and alert the data team if any validation fails. Failed validations should block downstream processing until resolved.
Monitoring Metrics
| Metric | Purpose |
|---|---|
| Missing value rates | Detect data source issues |
| Schema changes | Identify upstream modifications |
| Distribution shifts | Detect drift |
| Volume anomalies | Identify pipeline failures |
Data Challenges
Missing Data
| Strategy | Use Case |
|---|---|
| Drop rows | Few missing values, random missingness |
| Impute mean/median | Numerical features, random missingness |
| Impute mode | Categorical features |
| Model-based imputation | Complex patterns in missingness |
| Create indicator | Missingness is informative |
Class Imbalance
| Strategy | Description |
|---|---|
| Oversampling | Duplicate minority samples (SMOTE) |
| Undersampling | Remove majority samples |
| Class weights | Adjust loss function weights |
| Threshold tuning | Adjust decision boundary |
Data Leakage
Data leakage occurs when information unavailable at prediction time is used during training.
| Cause | Description |
|---|---|
| Future information | Using data from after the prediction time |
| Target leakage | Features derived from the label |
| Train-test contamination | Normalization fit on all data |
Prevention:
- Use time-based splits for temporal data
- Verify each feature can be computed before outcome observation
- Fit transformations on training data only
Reference
| Topic | Approach |
|---|---|
| Feature consistency | Feature store with shared code |
| Late-arriving data | Backfill pipelines, retraining triggers, or robust feature designs |
| Data validation | Schema checks, statistical tests, alerts on anomalies |
| Real-time features | Streaming pipeline writing to key-value store |
| Schema evolution | Backwards-compatible changes, versioned schemas, migration scripts |