Design a Real-Time Analytics System
Concepts covered: Stream processing (Flink/Spark Streaming), Kafka, windowing strategies, watermarks, late data handling, pre-aggregation, time-series databases, materialized views
This design covers a system for live dashboards with sub-second latency, processing millions of events per second and serving aggregated metrics instantly.
Requirements
Functional Requirements
- Ingest high-volume event streams (clicks, transactions, sensor data)
- Compute real-time aggregations: counts, sums, averages, percentiles
- Support multiple time windows: 1 min, 5 min, 1 hour, 24 hours
- Serve queries with sub-second latency
- Handle late-arriving data
- Filter by dimensions: region, product, user segment
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Ingestion latency | Under 100ms | Events must appear promptly |
| Query latency | Under 500ms p99 | Dashboards require immediate response |
| Throughput | 1M+ events/second | High-traffic system |
| Availability | 99.99% | Dashboards cannot be unavailable during peak hours |
Scale Estimation
- 1 million events per second at peak
- 100 concurrent dashboards
- 50 different metric types
- 30-day retention for detailed data
Volume calculation:
- Event size: ~500 bytes average
- Daily volume: 1M events/sec x 86400 sec x 500B = 43 TB/day
At this scale, raw event queries are not feasible. Pre-aggregation is required.
High-Level Architecture
Stream Processing
Windowing Strategies
Streams are unbounded. Windows define aggregation boundaries.
| Window Type | Behavior | Use Case |
|---|---|---|
| Tumbling | Fixed buckets, no overlap | "Requests per minute" |
| Sliding | Overlapping windows | "5-minute moving average, updated every minute" |
| Session | Based on activity gaps | "Session ends after 30 minutes of inactivity" |
| Global | All-time, no window | "Total users ever" |
Flink Processing Job
Aggregates page views by region and page using 1-minute tumbling windows.
Event schema: Each PageViewEvent contains userId, pageId, region, timestamp, and responseTimeMs.
Processing pipeline:
- Source: Consume from Kafka topic "pageviews"
- Watermarks: Assign timestamps from event time with 5-second allowed out-of-orderness
- Key by: Group events by (region, pageId) combination
- Window: Apply 1-minute tumbling windows based on event time
- Aggregate: For each window, compute count, unique users, average response time, and p95 response time
- Sink: Write aggregated metrics to time-series database (Druid)
Aggregation logic:
- Create accumulator to track count, total response time, unique user set, and response time distribution
- For each event: increment count, add to total response time, add user to unique set, add to distribution
- On window close: output metrics with count, unique user count, average, and p95
- Merge function combines accumulators for parallel processing
Handling Late Data
Events may arrive after their window has closed due to network delays or offline devices.
Late data strategies:
| Strategy | Advantages | Disadvantages |
|---|---|---|
| Drop late events | Simple implementation | Data loss |
| Allowed lateness | Handles normal delays | Increased window state retention |
| Side output | No data loss | Requires separate processing path |
| Batch recomputation | Perfect accuracy | Delayed corrections |
Standard approach: Use allowed lateness (e.g., 5 seconds) for streaming, recompute in batch for events arriving later.
Pre-Aggregation Strategy
Multi-Resolution Rollups
Store aggregates at multiple granularities. Select appropriate resolution per query.
Query routing:
| Time Range | Granularity |
|---|---|
| Last 1 hour | 1-minute rollups |
| Last 24 hours | 5-minute rollups |
| Last 30 days | 1-hour rollups |
| Longer | 1-day rollups |
Materialized Views
Pre-compute results for common query patterns.
ClickHouse materialized view approach:
- Create a materialized view that automatically aggregates incoming pageview data
- Group by region and truncated minute timestamp
- Calculate metrics: view count, unique users, average response time, p95 response time
- Use SummingMergeTree engine to efficiently merge incremental aggregations
- Partition by day for query performance and data lifecycle management
Serving Layer
Time-Series Database Selection
| Database | Characteristics | Use Case |
|---|---|---|
| Druid | High ingestion, OLAP queries | Large-scale analytics |
| ClickHouse | Fast aggregations, SQL interface | Ad-hoc queries |
| TimescaleDB | PostgreSQL compatibility | Smaller scale, familiar stack |
| InfluxDB | Purpose-built time-series | Metrics and monitoring |
Druid and ClickHouse are standard choices for real-time analytics at scale.
Query API Design
Query parameters:
- metricName: The metric to retrieve (required)
- dimensions: Filters to apply (e.g., region = "US", product in ["A", "B"])
- timeRange: Start and end timestamps (required)
- granularity: Time bucket size (MINUTE, FIVE_MINUTES, HOUR, DAY)
- aggregations: Statistics to compute (COUNT, SUM, AVG, P50, P95, P99, UNIQUE_COUNT)
Dimension filter: Specifies the dimension name, allowed values, and comparison operator.
Design rationale: This flexible schema allows dashboards to request exactly the data they need, with appropriate granularity for the time range.
Query Optimization
Query service workflow:
- Check cache: Build cache key from request parameters, return cached result if available
- Select granularity: Choose appropriate rollup based on time range (minute for up to 1 hour, five_minute for up to 24 hours, hour for up to 30 days, day for longer)
- Execute query: Build and run query against time-series database
- Cache result: Store with TTL based on data recency
Cache TTL strategy:
- Recent data (within 5 minutes of now): 10-second TTL (changes frequently)
- Data within last hour: 60-second TTL
- Historical data: 300-second TTL (stable)
Real-Time Dashboard
Push vs Pull
| Approach | Latency | Server Load | Complexity |
|---|---|---|---|
| Polling | Seconds (interval-dependent) | Higher | Lower |
| WebSocket | Sub-second | Lower (no repeated connections) | Higher |
| Server-Sent Events | Sub-second | Medium | Medium |
For true real-time updates, push mechanisms (WebSocket or SSE) are required.
WebSocket Implementation
Clients subscribe to specific queries. Server pushes updates when new aggregates arrive.
Server-side:
- Maintain a map of client IDs to their subscribed queries
- On connection: register client subscription and send initial data snapshot
- On new data arrival: iterate through subscriptions, check if the update matches each client's query criteria, and push relevant updates
Client-side (React hook):
- Open WebSocket connection on component mount
- Send subscribe message with query parameters
- On message received: merge update into existing state
- Clean up connection on unmount
This pattern ensures dashboards receive real-time updates without polling overhead.
Fault Tolerance
Exactly-Once Semantics
Requires coordination across the entire pipeline.
Requirements:
- Source supports replay (Kafka retains offsets)
- Processor checkpoints state (Flink saves to S3)
- Sink supports idempotent writes (or transactions)
All three components must participate. Failure in any component results in duplicates or data loss.
Checkpoint Configuration
Flink checkpoint settings:
- Interval: Checkpoint every 60 seconds
- Mode: Exactly-once semantics for data integrity
- Minimum pause: At least 30 seconds between checkpoints to avoid overload
- Timeout: Fail checkpoint if not complete within 120 seconds
- State backend: RocksDB with S3 storage for durability and scalability
Monitoring
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Event ingestion rate | Stable | 50%+ drop from baseline |
| Processing lag | Under 1 min | Over 5 min |
| Query latency p99 | Under 500ms | Over 1s |
| Checkpoint duration | Under 30s | Over 60s |
| Consumer group lag | Under 10K messages | Over 100K messages |
Consumer lag indicates whether processing keeps pace with ingestion.
Observability Stack
Components:
- Prometheus: Time-series database for metrics collection, configured via prometheus.yml
- Grafana: Visualization and dashboarding, connects to Prometheus as data source
- Kafka exporter: Exports Kafka metrics (consumer lag, throughput) to Prometheus format
This stack provides end-to-end visibility into pipeline health and performance.
Interview Discussion Points
-
Latency requirements: Sub-second query latency requires pre-aggregated rollups. Raw event queries are not feasible at this scale.
-
Windowing: Tumbling windows for regular metrics, with allowed lateness for delayed events.
-
Storage tiers: Multi-resolution rollups (1-minute, 5-minute, hourly). Query routing selects appropriate granularity.
-
Exactly-once semantics: Flink checkpoints, Kafka offsets, and idempotent writes must work together.
-
Late data handling: Accept 5 seconds of lateness in streaming path. Batch recomputation handles events arriving later.
Summary
| Decision | Options | Recommendation |
|---|---|---|
| Stream processing | Flink, Spark Streaming, Kafka Streams | Flink (true streaming, exactly-once support) |
| Message queue | Kafka, Pulsar, Kinesis | Kafka (mature, high throughput) |
| Time-series DB | Druid, ClickHouse, TimescaleDB | Druid or ClickHouse (scale-dependent) |
| Windowing | Tumbling, Sliding, Session | Tumbling for metrics, with allowed lateness |
| Dashboard updates | Polling, WebSocket, SSE | WebSocket for true real-time |
| Aggregation | On-read, Pre-computed | Pre-computed (raw queries infeasible at 1M/sec) |