Design a Real-Time Analytics System

Concepts covered: Stream processing (Flink/Spark Streaming), Kafka, windowing strategies, watermarks, late data handling, pre-aggregation, time-series databases, materialized views

This design covers a system for live dashboards with sub-second latency, processing millions of events per second and serving aggregated metrics instantly.

Requirements

Functional Requirements

Ingest high-volume event streams (clicks, transactions, sensor data)
Compute real-time aggregations: counts, sums, averages, percentiles
Support multiple time windows: 1 min, 5 min, 1 hour, 24 hours
Serve queries with sub-second latency
Handle late-arriving data
Filter by dimensions: region, product, user segment

Non-Functional Requirements

Requirement	Target	Rationale
Ingestion latency	Under 100ms	Events must appear promptly
Query latency	Under 500ms p99	Dashboards require immediate response
Throughput	1M+ events/second	High-traffic system
Availability	99.99%	Dashboards cannot be unavailable during peak hours

Scale Estimation

1 million events per second at peak
100 concurrent dashboards
50 different metric types
30-day retention for detailed data

Volume calculation:

Event size: ~500 bytes average
Daily volume: 1M events/sec x 86400 sec x 500B = 43 TB/day

At this scale, raw event queries are not feasible. Pre-aggregation is required.

High-Level Architecture

Loading diagram...

Stream Processing

Windowing Strategies

Streams are unbounded. Windows define aggregation boundaries.

Window Type	Behavior	Use Case
Tumbling	Fixed buckets, no overlap	"Requests per minute"
Sliding	Overlapping windows	"5-minute moving average, updated every minute"
Session	Based on activity gaps	"Session ends after 30 minutes of inactivity"
Global	All-time, no window	"Total users ever"

Loading diagram...

Flink Processing Job

Aggregates page views by region and page using 1-minute tumbling windows.

Event schema: Each PageViewEvent contains userId, pageId, region, timestamp, and responseTimeMs.

Processing pipeline:

Source: Consume from Kafka topic "pageviews"
Watermarks: Assign timestamps from event time with 5-second allowed out-of-orderness
Key by: Group events by (region, pageId) combination
Window: Apply 1-minute tumbling windows based on event time
Aggregate: For each window, compute count, unique users, average response time, and p95 response time
Sink: Write aggregated metrics to time-series database (Druid)

Aggregation logic:

Create accumulator to track count, total response time, unique user set, and response time distribution
For each event: increment count, add to total response time, add user to unique set, add to distribution
On window close: output metrics with count, unique user count, average, and p95
Merge function combines accumulators for parallel processing

Handling Late Data

Events may arrive after their window has closed due to network delays or offline devices.

Loading diagram...

Late data strategies:

Strategy	Advantages	Disadvantages
Drop late events	Simple implementation	Data loss
Allowed lateness	Handles normal delays	Increased window state retention
Side output	No data loss	Requires separate processing path
Batch recomputation	Perfect accuracy	Delayed corrections

Standard approach: Use allowed lateness (e.g., 5 seconds) for streaming, recompute in batch for events arriving later.

Pre-Aggregation Strategy

Multi-Resolution Rollups

Store aggregates at multiple granularities. Select appropriate resolution per query.

Loading diagram...

Query routing:

Time Range	Granularity
Last 1 hour	1-minute rollups
Last 24 hours	5-minute rollups
Last 30 days	1-hour rollups
Longer	1-day rollups

Materialized Views

Pre-compute results for common query patterns.

ClickHouse materialized view approach:

Create a materialized view that automatically aggregates incoming pageview data
Group by region and truncated minute timestamp
Calculate metrics: view count, unique users, average response time, p95 response time
Use SummingMergeTree engine to efficiently merge incremental aggregations
Partition by day for query performance and data lifecycle management

Serving Layer

Time-Series Database Selection

Database	Characteristics	Use Case
Druid	High ingestion, OLAP queries	Large-scale analytics
ClickHouse	Fast aggregations, SQL interface	Ad-hoc queries
TimescaleDB	PostgreSQL compatibility	Smaller scale, familiar stack
InfluxDB	Purpose-built time-series	Metrics and monitoring

Druid and ClickHouse are standard choices for real-time analytics at scale.

Query API Design

Query parameters:

metricName: The metric to retrieve (required)
dimensions: Filters to apply (e.g., region = "US", product in ["A", "B"])
timeRange: Start and end timestamps (required)
granularity: Time bucket size (MINUTE, FIVE_MINUTES, HOUR, DAY)
aggregations: Statistics to compute (COUNT, SUM, AVG, P50, P95, P99, UNIQUE_COUNT)

Dimension filter: Specifies the dimension name, allowed values, and comparison operator.

Design rationale: This flexible schema allows dashboards to request exactly the data they need, with appropriate granularity for the time range.

Query Optimization

Query service workflow:

Check cache: Build cache key from request parameters, return cached result if available
Select granularity: Choose appropriate rollup based on time range (minute for up to 1 hour, five_minute for up to 24 hours, hour for up to 30 days, day for longer)
Execute query: Build and run query against time-series database
Cache result: Store with TTL based on data recency

Cache TTL strategy:

Recent data (within 5 minutes of now): 10-second TTL (changes frequently)
Data within last hour: 60-second TTL
Historical data: 300-second TTL (stable)

Real-Time Dashboard

Push vs Pull

Approach	Latency	Server Load	Complexity
Polling	Seconds (interval-dependent)	Higher	Lower
WebSocket	Sub-second	Lower (no repeated connections)	Higher
Server-Sent Events	Sub-second	Medium	Medium

For true real-time updates, push mechanisms (WebSocket or SSE) are required.

WebSocket Implementation

Clients subscribe to specific queries. Server pushes updates when new aggregates arrive.

Server-side:

Maintain a map of client IDs to their subscribed queries
On connection: register client subscription and send initial data snapshot
On new data arrival: iterate through subscriptions, check if the update matches each client's query criteria, and push relevant updates

Client-side (React hook):

Open WebSocket connection on component mount
Send subscribe message with query parameters
On message received: merge update into existing state
Clean up connection on unmount

This pattern ensures dashboards receive real-time updates without polling overhead.

Fault Tolerance

Exactly-Once Semantics

Requires coordination across the entire pipeline.

Loading diagram...

Requirements:

Source supports replay (Kafka retains offsets)
Processor checkpoints state (Flink saves to S3)
Sink supports idempotent writes (or transactions)

All three components must participate. Failure in any component results in duplicates or data loss.

Checkpoint Configuration

Flink checkpoint settings:

Interval: Checkpoint every 60 seconds
Mode: Exactly-once semantics for data integrity
Minimum pause: At least 30 seconds between checkpoints to avoid overload
Timeout: Fail checkpoint if not complete within 120 seconds
State backend: RocksDB with S3 storage for durability and scalability

Monitoring

Key Metrics

Metric	Target	Alert Threshold
Event ingestion rate	Stable	50%+ drop from baseline
Processing lag	Under 1 min	Over 5 min
Query latency p99	Under 500ms	Over 1s
Checkpoint duration	Under 30s	Over 60s
Consumer group lag	Under 10K messages	Over 100K messages

Consumer lag indicates whether processing keeps pace with ingestion.

Observability Stack

Components:

Prometheus: Time-series database for metrics collection, configured via prometheus.yml
Grafana: Visualization and dashboarding, connects to Prometheus as data source
Kafka exporter: Exports Kafka metrics (consumer lag, throughput) to Prometheus format

This stack provides end-to-end visibility into pipeline health and performance.

Interview Discussion Points

Latency requirements: Sub-second query latency requires pre-aggregated rollups. Raw event queries are not feasible at this scale.
Windowing: Tumbling windows for regular metrics, with allowed lateness for delayed events.
Storage tiers: Multi-resolution rollups (1-minute, 5-minute, hourly). Query routing selects appropriate granularity.
Exactly-once semantics: Flink checkpoints, Kafka offsets, and idempotent writes must work together.
Late data handling: Accept 5 seconds of lateness in streaming path. Batch recomputation handles events arriving later.

Summary

Decision	Options	Recommendation
Stream processing	Flink, Spark Streaming, Kafka Streams	Flink (true streaming, exactly-once support)
Message queue	Kafka, Pulsar, Kinesis	Kafka (mature, high throughput)
Time-series DB	Druid, ClickHouse, TimescaleDB	Druid or ClickHouse (scale-dependent)
Windowing	Tumbling, Sliding, Session	Tumbling for metrics, with allowed lateness
Dashboard updates	Polling, WebSocket, SSE	WebSocket for true real-time
Aggregation	On-read, Pre-computed	Pre-computed (raw queries infeasible at 1M/sec)

Requirements​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

High-Level Architecture​

Stream Processing​

Windowing Strategies​

Flink Processing Job​

Handling Late Data​

Pre-Aggregation Strategy​

Multi-Resolution Rollups​

Materialized Views​

Serving Layer​

Time-Series Database Selection​

Query API Design​

Query Optimization​

Real-Time Dashboard​

Push vs Pull​

WebSocket Implementation​

Fault Tolerance​

Exactly-Once Semantics​

Checkpoint Configuration​

Monitoring​

Key Metrics​

Observability Stack​

Interview Discussion Points​

Summary​

Table of Contents

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

High-Level Architecture

Stream Processing

Windowing Strategies

Flink Processing Job

Handling Late Data

Pre-Aggregation Strategy

Multi-Resolution Rollups

Materialized Views

Serving Layer

Time-Series Database Selection

Query API Design

Query Optimization

Real-Time Dashboard

Push vs Pull

WebSocket Implementation

Fault Tolerance

Exactly-Once Semantics

Checkpoint Configuration

Monitoring

Key Metrics

Observability Stack

Interview Discussion Points

Summary