Skip to main content

Batch vs Streaming

Batch processing operates on complete, bounded datasets. Stream processing operates on continuous, unbounded data flows. The selection depends on latency requirements, data completeness needs, and operational complexity tolerance.

Comparison

AspectBatchStreaming
DataComplete, boundedContinuous, unbounded
LatencyMinutes to hoursMilliseconds to seconds
ThroughputHighModerate
ComplexityLowerHigher
Failure recoveryRerun jobCheckpoints, replay

Batch: Process accumulated data (e.g., yesterday's transactions). Streaming: Process events as they occur (e.g., real-time fraud detection).

Batch Processing

Architecture

Batch processing transforms input data in bulk operations.

The MapReduce model:

Loading diagram...
  1. Map: Transform each record independently
  2. Shuffle: Group by key, distribute to reducers
  3. Reduce: Aggregate each group

Modern frameworks (Spark, Hive, Presto) provide SQL-like interfaces over this model.

Characteristics

CharacteristicDescription
ThroughputBulk operations enable optimization (sequential reads, sorted joins, columnar formats)
SimplicityComplete input available; no late data or ordering concerns
DeterminismSame input produces same output; enables testing and debugging

Use Cases

Use CaseRationale
Daily ETL pipelinesData arrives in batches
ML model trainingRequires complete dataset
Historical analysisQueries over past data
ReportingScheduled, not latency-sensitive
Data warehouse loadingBulk inserts are efficient

Batch processing is appropriate when latency tolerance permits and complete data is required.

Stream Processing

Architecture

Stream processors handle continuous data flows with immediate processing.

Loading diagram...

Challenges: Data never terminates, events may arrive late or out of order.

Key Concepts

Event time vs Processing time:

ConceptDefinition
Event timeTimestamp when event occurred
Processing timeTimestamp when system processed event

Example: A mobile app event at 2:00 PM may arrive at 2:15 PM due to connectivity delay. Processing time would assign it to the wrong time window.

Windowing:

Window TypeBehavior
TumblingFixed buckets (e.g., every 5 minutes)
SlidingOverlapping windows (e.g., 5-minute window, updated every minute)
SessionBased on activity gaps (e.g., 30-minute inactivity ends session)

Watermarks: System estimate of event time progress. Events arriving after the watermark are considered "late."

Use Cases

Use CaseRationale
Fraud detectionImmediate action required
Real-time dashboardsUsers expect live data
AlertingTime-sensitive notifications
User session analysisState tracking during activity
Change data captureReal-time database synchronization

Streaming is appropriate when sub-minute latency is a business requirement.

Architectural Patterns

Lambda Architecture

Parallel batch and streaming paths with merged results.

Loading diagram...

Components:

  • Batch layer: Processes all historical data periodically
  • Speed layer: Processes recent data in real-time
  • Serving layer: Merges results from both layers

Advantages: Accurate historical results plus fast recent results.

Disadvantages: Two codebases for similar logic. Result merging complexity. Debugging difficulty.

Kappa Architecture

Streaming-only architecture. Historical reprocessing via replay.

Loading diagram...

Components:

  • All data flows through durable log (Kafka)
  • Stream processor reads from log
  • Reprocessing: Start new consumer from beginning

Advantages: Single codebase. Simpler mental model.

Disadvantages: Full history replay can be slow. Log storage costs.

Selection Criteria

FactorLambdaKappa
Team sizeLarger (maintains two systems)Smaller
Reprocessing needsFrequent, must be fastRare, can be slow
Query patternsComplex ad-hoc queriesWell-defined queries
Complexity toleranceHighLow

Kappa architecture is generally preferred for simplicity. Add batch capabilities if specific limitations arise.

Delivery Guarantees

GuaranteeDefinitionTrade-off
At-most-onceMessages may be lost, never duplicatedFast, simple
At-least-onceMessages never lost, may be duplicatedRequires idempotent consumers
Exactly-onceNo loss, no duplicatesComplex, slower

At-most-once: No retry on failure. Data loss possible.

At-least-once: Retry until acknowledgment. Duplicates possible if failure occurs after processing but before acknowledgment.

Exactly-once: Requires transactions (Kafka), checkpointing (Flink), or sink deduplication.

Most production systems implement at-least-once delivery with idempotent operations. Exactly-once adds latency and complexity.

Technologies

TypeExamplesUse Case
BatchSpark, Hive, Presto, TrinoHigh-throughput SQL analytics
StreamingKafka Streams, Flink, Spark StreamingLow-latency event processing
UnifiedApache Beam, DataflowSame code for batch and streaming
LogKafka, PulsarDurable message storage, replay

Selection Guidelines

RequirementTechnology
SQL analytics on large datasetsSpark SQL or Presto
Complex stateful stream processingFlink
Kafka-based, simple transformationsKafka Streams
GCP environmentDataflow (Beam)
Combined batch and streamingSpark (handles both)

Interview Questions

How do you handle late-arriving data?

Use event-time processing with watermarks. Define lateness threshold for the use case (e.g., accept data up to 1 hour after window closes).

Late event options:

  1. Drop (simple, data loss)
  2. Trigger window recomputation (correct, expensive)
  3. Send to side output for separate handling (flexible)

Trade-off: Completeness versus latency and complexity.

When would you choose Kappa over Lambda?

Kappa is the default choice due to simplicity (single codebase, single mental model).

Add a batch layer (Lambda) when:

  • Reprocessing history is prohibitively slow in streaming
  • Ad-hoc SQL queries over historical data are required
  • Cost difference between batch and streaming is significant

How do you achieve exactly-once processing?

End-to-end exactly-once requires:

  1. Source: Replay capability (Kafka stores offsets)
  2. Processing: Atomic checkpoints (Flink) or transactions (Kafka Streams)
  3. Sink: Idempotent writes or transactional commits

In practice, design for at-least-once with idempotent downstream processing.

Summary

ConsiderationGuidance
Latency vs ThroughputBatch for throughput, streaming for latency
Time semanticsEvent time processing over processing time
ArchitectureStart with Kappa, add batch if needed
Delivery guaranteeAt-least-once with idempotent operations
Cost factorsInclude development, operations, compute, and storage
Default choiceBatch if latency permits; simpler implementation