Skip to main content

Design a Real-Time Analytics System

Concepts covered: Stream processing (Flink/Spark Streaming), Kafka, windowing strategies, watermarks, late data handling, pre-aggregation, time-series databases, materialized views

This design covers a system for live dashboards with sub-second latency, processing millions of events per second and serving aggregated metrics instantly.

Requirements

Functional Requirements

  • Ingest high-volume event streams (clicks, transactions, sensor data)
  • Compute real-time aggregations: counts, sums, averages, percentiles
  • Support multiple time windows: 1 min, 5 min, 1 hour, 24 hours
  • Serve queries with sub-second latency
  • Handle late-arriving data
  • Filter by dimensions: region, product, user segment

Non-Functional Requirements

RequirementTargetRationale
Ingestion latencyUnder 100msEvents must appear promptly
Query latencyUnder 500ms p99Dashboards require immediate response
Throughput1M+ events/secondHigh-traffic system
Availability99.99%Dashboards cannot be unavailable during peak hours

Scale Estimation

  • 1 million events per second at peak
  • 100 concurrent dashboards
  • 50 different metric types
  • 30-day retention for detailed data

Volume calculation:

  • Event size: ~500 bytes average
  • Daily volume: 1M events/sec x 86400 sec x 500B = 43 TB/day

At this scale, raw event queries are not feasible. Pre-aggregation is required.

High-Level Architecture

Loading diagram...

Stream Processing

Windowing Strategies

Streams are unbounded. Windows define aggregation boundaries.

Window TypeBehaviorUse Case
TumblingFixed buckets, no overlap"Requests per minute"
SlidingOverlapping windows"5-minute moving average, updated every minute"
SessionBased on activity gaps"Session ends after 30 minutes of inactivity"
GlobalAll-time, no window"Total users ever"
Loading diagram...

Aggregates page views by region and page using 1-minute tumbling windows.

Event schema: Each PageViewEvent contains userId, pageId, region, timestamp, and responseTimeMs.

Processing pipeline:

  1. Source: Consume from Kafka topic "pageviews"
  2. Watermarks: Assign timestamps from event time with 5-second allowed out-of-orderness
  3. Key by: Group events by (region, pageId) combination
  4. Window: Apply 1-minute tumbling windows based on event time
  5. Aggregate: For each window, compute count, unique users, average response time, and p95 response time
  6. Sink: Write aggregated metrics to time-series database (Druid)

Aggregation logic:

  • Create accumulator to track count, total response time, unique user set, and response time distribution
  • For each event: increment count, add to total response time, add user to unique set, add to distribution
  • On window close: output metrics with count, unique user count, average, and p95
  • Merge function combines accumulators for parallel processing

Handling Late Data

Events may arrive after their window has closed due to network delays or offline devices.

Loading diagram...

Late data strategies:

StrategyAdvantagesDisadvantages
Drop late eventsSimple implementationData loss
Allowed latenessHandles normal delaysIncreased window state retention
Side outputNo data lossRequires separate processing path
Batch recomputationPerfect accuracyDelayed corrections

Standard approach: Use allowed lateness (e.g., 5 seconds) for streaming, recompute in batch for events arriving later.

Pre-Aggregation Strategy

Multi-Resolution Rollups

Store aggregates at multiple granularities. Select appropriate resolution per query.

Loading diagram...

Query routing:

Time RangeGranularity
Last 1 hour1-minute rollups
Last 24 hours5-minute rollups
Last 30 days1-hour rollups
Longer1-day rollups

Materialized Views

Pre-compute results for common query patterns.

ClickHouse materialized view approach:

  • Create a materialized view that automatically aggregates incoming pageview data
  • Group by region and truncated minute timestamp
  • Calculate metrics: view count, unique users, average response time, p95 response time
  • Use SummingMergeTree engine to efficiently merge incremental aggregations
  • Partition by day for query performance and data lifecycle management

Serving Layer

Time-Series Database Selection

DatabaseCharacteristicsUse Case
DruidHigh ingestion, OLAP queriesLarge-scale analytics
ClickHouseFast aggregations, SQL interfaceAd-hoc queries
TimescaleDBPostgreSQL compatibilitySmaller scale, familiar stack
InfluxDBPurpose-built time-seriesMetrics and monitoring

Druid and ClickHouse are standard choices for real-time analytics at scale.

Query API Design

Query parameters:

  • metricName: The metric to retrieve (required)
  • dimensions: Filters to apply (e.g., region = "US", product in ["A", "B"])
  • timeRange: Start and end timestamps (required)
  • granularity: Time bucket size (MINUTE, FIVE_MINUTES, HOUR, DAY)
  • aggregations: Statistics to compute (COUNT, SUM, AVG, P50, P95, P99, UNIQUE_COUNT)

Dimension filter: Specifies the dimension name, allowed values, and comparison operator.

Design rationale: This flexible schema allows dashboards to request exactly the data they need, with appropriate granularity for the time range.

Query Optimization

Query service workflow:

  1. Check cache: Build cache key from request parameters, return cached result if available
  2. Select granularity: Choose appropriate rollup based on time range (minute for up to 1 hour, five_minute for up to 24 hours, hour for up to 30 days, day for longer)
  3. Execute query: Build and run query against time-series database
  4. Cache result: Store with TTL based on data recency

Cache TTL strategy:

  • Recent data (within 5 minutes of now): 10-second TTL (changes frequently)
  • Data within last hour: 60-second TTL
  • Historical data: 300-second TTL (stable)

Real-Time Dashboard

Push vs Pull

ApproachLatencyServer LoadComplexity
PollingSeconds (interval-dependent)HigherLower
WebSocketSub-secondLower (no repeated connections)Higher
Server-Sent EventsSub-secondMediumMedium

For true real-time updates, push mechanisms (WebSocket or SSE) are required.

WebSocket Implementation

Clients subscribe to specific queries. Server pushes updates when new aggregates arrive.

Server-side:

  • Maintain a map of client IDs to their subscribed queries
  • On connection: register client subscription and send initial data snapshot
  • On new data arrival: iterate through subscriptions, check if the update matches each client's query criteria, and push relevant updates

Client-side (React hook):

  • Open WebSocket connection on component mount
  • Send subscribe message with query parameters
  • On message received: merge update into existing state
  • Clean up connection on unmount

This pattern ensures dashboards receive real-time updates without polling overhead.

Fault Tolerance

Exactly-Once Semantics

Requires coordination across the entire pipeline.

Loading diagram...

Requirements:

  1. Source supports replay (Kafka retains offsets)
  2. Processor checkpoints state (Flink saves to S3)
  3. Sink supports idempotent writes (or transactions)

All three components must participate. Failure in any component results in duplicates or data loss.

Checkpoint Configuration

Flink checkpoint settings:

  • Interval: Checkpoint every 60 seconds
  • Mode: Exactly-once semantics for data integrity
  • Minimum pause: At least 30 seconds between checkpoints to avoid overload
  • Timeout: Fail checkpoint if not complete within 120 seconds
  • State backend: RocksDB with S3 storage for durability and scalability

Monitoring

Key Metrics

MetricTargetAlert Threshold
Event ingestion rateStable50%+ drop from baseline
Processing lagUnder 1 minOver 5 min
Query latency p99Under 500msOver 1s
Checkpoint durationUnder 30sOver 60s
Consumer group lagUnder 10K messagesOver 100K messages

Consumer lag indicates whether processing keeps pace with ingestion.

Observability Stack

Components:

  • Prometheus: Time-series database for metrics collection, configured via prometheus.yml
  • Grafana: Visualization and dashboarding, connects to Prometheus as data source
  • Kafka exporter: Exports Kafka metrics (consumer lag, throughput) to Prometheus format

This stack provides end-to-end visibility into pipeline health and performance.

Interview Discussion Points

  1. Latency requirements: Sub-second query latency requires pre-aggregated rollups. Raw event queries are not feasible at this scale.

  2. Windowing: Tumbling windows for regular metrics, with allowed lateness for delayed events.

  3. Storage tiers: Multi-resolution rollups (1-minute, 5-minute, hourly). Query routing selects appropriate granularity.

  4. Exactly-once semantics: Flink checkpoints, Kafka offsets, and idempotent writes must work together.

  5. Late data handling: Accept 5 seconds of lateness in streaming path. Batch recomputation handles events arriving later.

Summary

DecisionOptionsRecommendation
Stream processingFlink, Spark Streaming, Kafka StreamsFlink (true streaming, exactly-once support)
Message queueKafka, Pulsar, KinesisKafka (mature, high throughput)
Time-series DBDruid, ClickHouse, TimescaleDBDruid or ClickHouse (scale-dependent)
WindowingTumbling, Sliding, SessionTumbling for metrics, with allowed lateness
Dashboard updatesPolling, WebSocket, SSEWebSocket for true real-time
AggregationOn-read, Pre-computedPre-computed (raw queries infeasible at 1M/sec)