Batch vs Streaming
Batch processing operates on complete, bounded datasets. Stream processing operates on continuous, unbounded data flows. The selection depends on latency requirements, data completeness needs, and operational complexity tolerance.
Comparison
| Aspect | Batch | Streaming |
|---|---|---|
| Data | Complete, bounded | Continuous, unbounded |
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | High | Moderate |
| Complexity | Lower | Higher |
| Failure recovery | Rerun job | Checkpoints, replay |
Batch: Process accumulated data (e.g., yesterday's transactions). Streaming: Process events as they occur (e.g., real-time fraud detection).
Batch Processing
Architecture
Batch processing transforms input data in bulk operations.
The MapReduce model:
- Map: Transform each record independently
- Shuffle: Group by key, distribute to reducers
- Reduce: Aggregate each group
Modern frameworks (Spark, Hive, Presto) provide SQL-like interfaces over this model.
Characteristics
| Characteristic | Description |
|---|---|
| Throughput | Bulk operations enable optimization (sequential reads, sorted joins, columnar formats) |
| Simplicity | Complete input available; no late data or ordering concerns |
| Determinism | Same input produces same output; enables testing and debugging |
Use Cases
| Use Case | Rationale |
|---|---|
| Daily ETL pipelines | Data arrives in batches |
| ML model training | Requires complete dataset |
| Historical analysis | Queries over past data |
| Reporting | Scheduled, not latency-sensitive |
| Data warehouse loading | Bulk inserts are efficient |
Batch processing is appropriate when latency tolerance permits and complete data is required.
Stream Processing
Architecture
Stream processors handle continuous data flows with immediate processing.
Challenges: Data never terminates, events may arrive late or out of order.
Key Concepts
Event time vs Processing time:
| Concept | Definition |
|---|---|
| Event time | Timestamp when event occurred |
| Processing time | Timestamp when system processed event |
Example: A mobile app event at 2:00 PM may arrive at 2:15 PM due to connectivity delay. Processing time would assign it to the wrong time window.
Windowing:
| Window Type | Behavior |
|---|---|
| Tumbling | Fixed buckets (e.g., every 5 minutes) |
| Sliding | Overlapping windows (e.g., 5-minute window, updated every minute) |
| Session | Based on activity gaps (e.g., 30-minute inactivity ends session) |
Watermarks: System estimate of event time progress. Events arriving after the watermark are considered "late."
Use Cases
| Use Case | Rationale |
|---|---|
| Fraud detection | Immediate action required |
| Real-time dashboards | Users expect live data |
| Alerting | Time-sensitive notifications |
| User session analysis | State tracking during activity |
| Change data capture | Real-time database synchronization |
Streaming is appropriate when sub-minute latency is a business requirement.
Architectural Patterns
Lambda Architecture
Parallel batch and streaming paths with merged results.
Components:
- Batch layer: Processes all historical data periodically
- Speed layer: Processes recent data in real-time
- Serving layer: Merges results from both layers
Advantages: Accurate historical results plus fast recent results.
Disadvantages: Two codebases for similar logic. Result merging complexity. Debugging difficulty.
Kappa Architecture
Streaming-only architecture. Historical reprocessing via replay.
Components:
- All data flows through durable log (Kafka)
- Stream processor reads from log
- Reprocessing: Start new consumer from beginning
Advantages: Single codebase. Simpler mental model.
Disadvantages: Full history replay can be slow. Log storage costs.
Selection Criteria
| Factor | Lambda | Kappa |
|---|---|---|
| Team size | Larger (maintains two systems) | Smaller |
| Reprocessing needs | Frequent, must be fast | Rare, can be slow |
| Query patterns | Complex ad-hoc queries | Well-defined queries |
| Complexity tolerance | High | Low |
Kappa architecture is generally preferred for simplicity. Add batch capabilities if specific limitations arise.
Delivery Guarantees
| Guarantee | Definition | Trade-off |
|---|---|---|
| At-most-once | Messages may be lost, never duplicated | Fast, simple |
| At-least-once | Messages never lost, may be duplicated | Requires idempotent consumers |
| Exactly-once | No loss, no duplicates | Complex, slower |
At-most-once: No retry on failure. Data loss possible.
At-least-once: Retry until acknowledgment. Duplicates possible if failure occurs after processing but before acknowledgment.
Exactly-once: Requires transactions (Kafka), checkpointing (Flink), or sink deduplication.
Most production systems implement at-least-once delivery with idempotent operations. Exactly-once adds latency and complexity.
Technologies
| Type | Examples | Use Case |
|---|---|---|
| Batch | Spark, Hive, Presto, Trino | High-throughput SQL analytics |
| Streaming | Kafka Streams, Flink, Spark Streaming | Low-latency event processing |
| Unified | Apache Beam, Dataflow | Same code for batch and streaming |
| Log | Kafka, Pulsar | Durable message storage, replay |
Selection Guidelines
| Requirement | Technology |
|---|---|
| SQL analytics on large datasets | Spark SQL or Presto |
| Complex stateful stream processing | Flink |
| Kafka-based, simple transformations | Kafka Streams |
| GCP environment | Dataflow (Beam) |
| Combined batch and streaming | Spark (handles both) |
Interview Questions
How do you handle late-arriving data?
Use event-time processing with watermarks. Define lateness threshold for the use case (e.g., accept data up to 1 hour after window closes).
Late event options:
- Drop (simple, data loss)
- Trigger window recomputation (correct, expensive)
- Send to side output for separate handling (flexible)
Trade-off: Completeness versus latency and complexity.
When would you choose Kappa over Lambda?
Kappa is the default choice due to simplicity (single codebase, single mental model).
Add a batch layer (Lambda) when:
- Reprocessing history is prohibitively slow in streaming
- Ad-hoc SQL queries over historical data are required
- Cost difference between batch and streaming is significant
How do you achieve exactly-once processing?
End-to-end exactly-once requires:
- Source: Replay capability (Kafka stores offsets)
- Processing: Atomic checkpoints (Flink) or transactions (Kafka Streams)
- Sink: Idempotent writes or transactional commits
In practice, design for at-least-once with idempotent downstream processing.
Summary
| Consideration | Guidance |
|---|---|
| Latency vs Throughput | Batch for throughput, streaming for latency |
| Time semantics | Event time processing over processing time |
| Architecture | Start with Kappa, add batch if needed |
| Delivery guarantee | At-least-once with idempotent operations |
| Cost factors | Include development, operations, compute, and storage |
| Default choice | Batch if latency permits; simpler implementation |