Batch vs Streaming

Batch processing operates on complete, bounded datasets. Stream processing operates on continuous, unbounded data flows. The selection depends on latency requirements, data completeness needs, and operational complexity tolerance.

Comparison

Aspect	Batch	Streaming
Data	Complete, bounded	Continuous, unbounded
Latency	Minutes to hours	Milliseconds to seconds
Throughput	High	Moderate
Complexity	Lower	Higher
Failure recovery	Rerun job	Checkpoints, replay

Batch: Process accumulated data (e.g., yesterday's transactions). Streaming: Process events as they occur (e.g., real-time fraud detection).

Batch Processing

Architecture

Batch processing transforms input data in bulk operations.

The MapReduce model:

Loading diagram...

Map: Transform each record independently
Shuffle: Group by key, distribute to reducers
Reduce: Aggregate each group

Modern frameworks (Spark, Hive, Presto) provide SQL-like interfaces over this model.

Characteristics

Characteristic	Description
Throughput	Bulk operations enable optimization (sequential reads, sorted joins, columnar formats)
Simplicity	Complete input available; no late data or ordering concerns
Determinism	Same input produces same output; enables testing and debugging

Use Cases

Use Case	Rationale
Daily ETL pipelines	Data arrives in batches
ML model training	Requires complete dataset
Historical analysis	Queries over past data
Reporting	Scheduled, not latency-sensitive
Data warehouse loading	Bulk inserts are efficient

Batch processing is appropriate when latency tolerance permits and complete data is required.

Stream Processing

Architecture

Stream processors handle continuous data flows with immediate processing.

Loading diagram...

Challenges: Data never terminates, events may arrive late or out of order.

Key Concepts

Event time vs Processing time:

Concept	Definition
Event time	Timestamp when event occurred
Processing time	Timestamp when system processed event

Example: A mobile app event at 2:00 PM may arrive at 2:15 PM due to connectivity delay. Processing time would assign it to the wrong time window.

Windowing:

Window Type	Behavior
Tumbling	Fixed buckets (e.g., every 5 minutes)
Sliding	Overlapping windows (e.g., 5-minute window, updated every minute)
Session	Based on activity gaps (e.g., 30-minute inactivity ends session)

Watermarks: System estimate of event time progress. Events arriving after the watermark are considered "late."

Use Cases

Use Case	Rationale
Fraud detection	Immediate action required
Real-time dashboards	Users expect live data
Alerting	Time-sensitive notifications
User session analysis	State tracking during activity
Change data capture	Real-time database synchronization

Streaming is appropriate when sub-minute latency is a business requirement.

Architectural Patterns

Lambda Architecture

Parallel batch and streaming paths with merged results.

Loading diagram...

Components:

Batch layer: Processes all historical data periodically
Speed layer: Processes recent data in real-time
Serving layer: Merges results from both layers

Advantages: Accurate historical results plus fast recent results.

Disadvantages: Two codebases for similar logic. Result merging complexity. Debugging difficulty.

Kappa Architecture

Streaming-only architecture. Historical reprocessing via replay.

Loading diagram...

Components:

All data flows through durable log (Kafka)
Stream processor reads from log
Reprocessing: Start new consumer from beginning

Advantages: Single codebase. Simpler mental model.

Disadvantages: Full history replay can be slow. Log storage costs.

Selection Criteria

Factor	Lambda	Kappa
Team size	Larger (maintains two systems)	Smaller
Reprocessing needs	Frequent, must be fast	Rare, can be slow
Query patterns	Complex ad-hoc queries	Well-defined queries
Complexity tolerance	High	Low

Kappa architecture is generally preferred for simplicity. Add batch capabilities if specific limitations arise.

Delivery Guarantees

Guarantee	Definition	Trade-off
At-most-once	Messages may be lost, never duplicated	Fast, simple
At-least-once	Messages never lost, may be duplicated	Requires idempotent consumers
Exactly-once	No loss, no duplicates	Complex, slower

At-most-once: No retry on failure. Data loss possible.

At-least-once: Retry until acknowledgment. Duplicates possible if failure occurs after processing but before acknowledgment.

Exactly-once: Requires transactions (Kafka), checkpointing (Flink), or sink deduplication.

Most production systems implement at-least-once delivery with idempotent operations. Exactly-once adds latency and complexity.

Technologies

Type	Examples	Use Case
Batch	Spark, Hive, Presto, Trino	High-throughput SQL analytics
Streaming	Kafka Streams, Flink, Spark Streaming	Low-latency event processing
Unified	Apache Beam, Dataflow	Same code for batch and streaming
Log	Kafka, Pulsar	Durable message storage, replay

Selection Guidelines

Requirement	Technology
SQL analytics on large datasets	Spark SQL or Presto
Complex stateful stream processing	Flink
Kafka-based, simple transformations	Kafka Streams
GCP environment	Dataflow (Beam)
Combined batch and streaming	Spark (handles both)

Interview Questions

How do you handle late-arriving data?

Use event-time processing with watermarks. Define lateness threshold for the use case (e.g., accept data up to 1 hour after window closes).

Late event options:

Drop (simple, data loss)
Trigger window recomputation (correct, expensive)
Send to side output for separate handling (flexible)

Trade-off: Completeness versus latency and complexity.

When would you choose Kappa over Lambda?

Kappa is the default choice due to simplicity (single codebase, single mental model).

Add a batch layer (Lambda) when:

Reprocessing history is prohibitively slow in streaming
Ad-hoc SQL queries over historical data are required
Cost difference between batch and streaming is significant

How do you achieve exactly-once processing?

End-to-end exactly-once requires:

Source: Replay capability (Kafka stores offsets)
Processing: Atomic checkpoints (Flink) or transactions (Kafka Streams)
Sink: Idempotent writes or transactional commits

In practice, design for at-least-once with idempotent downstream processing.

Summary

Consideration	Guidance
Latency vs Throughput	Batch for throughput, streaming for latency
Time semantics	Event time processing over processing time
Architecture	Start with Kappa, add batch if needed
Delivery guarantee	At-least-once with idempotent operations
Cost factors	Include development, operations, compute, and storage
Default choice	Batch if latency permits; simpler implementation

Comparison​

Batch Processing​

Architecture​

Characteristics​

Use Cases​

Stream Processing​

Architecture​

Key Concepts​

Use Cases​

Architectural Patterns​

Lambda Architecture​

Kappa Architecture​

Selection Criteria​

Delivery Guarantees​

Technologies​

Selection Guidelines​

Interview Questions​

How do you handle late-arriving data?​

When would you choose Kappa over Lambda?​

How do you achieve exactly-once processing?​

Summary​

Table of Contents

Comparison

Batch Processing

Architecture

Characteristics

Use Cases

Stream Processing

Architecture

Key Concepts

Use Cases

Architectural Patterns

Lambda Architecture

Kappa Architecture

Selection Criteria

Delivery Guarantees

Technologies

Selection Guidelines

Interview Questions

How do you handle late-arriving data?

When would you choose Kappa over Lambda?

How do you achieve exactly-once processing?

Summary