Design a Metrics and Monitoring System

Build a system that collects metrics from thousands of services, stores them efficiently, and alerts when thresholds are exceeded.

Related Concepts: Database Scaling, Message Queues, Availability Patterns

Step 1: Requirements and Scope

Functional Requirements

Requirement	Description
Metric ingestion	Collect metrics from services, hosts, containers
Storage	Store time-series data for querying
Querying	Support aggregations, groupings, time ranges
Dashboards	Visualize metrics over time
Alerting	Trigger alerts when metrics cross thresholds

Non-Functional Requirements

Requirement	Target	Rationale
Write throughput	10M metrics/second	Large fleets generate massive data
Query latency	< 1s for common queries	Dashboard responsiveness
Retention	30 days hot, 1 year cold	Balance storage cost with history
Availability	99.9%	Missing alerts is unacceptable

Scale Estimation

Monitored hosts: 100,000 servers
Metrics per host: 1,000 metrics
Collection interval: 10 seconds
Write rate: 100K hosts x 1K metrics / 10s = 10M writes/second
Data points per day: 10M x 8,640 seconds = 86 trillion points
Storage (compressed): ~10TB per day

Step 2: High-Level Architecture

Loading diagram...

Step 3: Data Model

Metric Structure

A metric is a named measurement with dimensions. It consists of a metric name (like "cpu.usage"), labels (key-value pairs like host, region, and environment), a value (such as 72.5), and a timestamp.

Cardinality: Each unique combination of labels is a separate time series. For example, a host label with 100,000 possible hosts creates 100,000 separate time series for a single metric name.

Time Series Storage

Loading diagram...

Store series metadata separately from data points.

Series table: Contains a series ID as the primary key, the metric name, and labels stored as JSON. For example, series "abc123" maps to metric "cpu.usage" with labels indicating the host and other dimensions.

Points table: Contains the series ID, timestamp, and value. Multiple rows share the same series ID but have different timestamps and values. This structure allows efficient time-range queries for a specific series.

Step 4: Ingestion Pipeline

Push vs Pull

Approach	Mechanism	Advantages	Disadvantages
Push	Services send metrics to collector	Works through firewalls, NAT	Requires agent everywhere
Pull	Collector scrapes /metrics endpoint	Simple, no agent needed	Does not work behind NAT

Recommendation: Support both. Pull for services you control, push for everything else. Prometheus uses pull, Datadog uses push.

Handling 10M writes/second

Direct writes of 10M rows per second to a database are not feasible. Buffer and aggregate.

Loading diagram...

Key techniques:

Batching: Writers accumulate points and write in batches of 1000+
Partitioning: Kafka partitions by series ID for parallelism
Pre-aggregation: Aggregate raw 10-second data to 1-minute data before storing

Pre-Aggregation

10-second granularity is not needed for last week's data.

Time Range	Resolution	Storage Impact
Last 1 hour	10 seconds	360 points
Last 24 hours	1 minute	1,440 points
Last 7 days	5 minutes	2,016 points
Last 30 days	1 hour	720 points

Aggregate as data ages. Store min, max, sum, count for each bucket to compute any aggregation later.

Step 5: Time Series Storage

Time series data is write-heavy and append-only. General-purpose databases are suboptimal.

Storage Options

Option	Mechanism	Advantages	Disadvantages
InfluxDB	Purpose-built TSDB	Easy to use, good compression	Scaling limits
TimescaleDB	PostgreSQL extension	SQL queries, mature ecosystem	Not as optimized
ClickHouse	Columnar OLAP	Fast aggregations, excellent compression	Operational complexity
Custom (LSM)	Log-structured merge tree	Maximum control	Build and maintain yourself

Recommendation: ClickHouse for large scale. Excellent compression and aggregation performance. Handles 10M+ writes/second with proper partitioning.

Compression Techniques

Time series compress extremely well:

Loading diagram...

Gorilla compression (from Facebook): XOR adjacent values, most bits are zeros. Achieves 10-12x compression on typical metrics.

Data Layout

Loading diagram...

Partition by time: Queries are usually time-bounded. Old partitions are dropped or archived.

Shard by series: Spread load across nodes. Queries for a specific series go to one shard.

Step 6: Query Engine

Query Language

Most time series databases have their own query language optimized for time series operations.

PromQL (Prometheus): Queries like computing the average rate of HTTP 500 errors over the last 5 minutes, grouped by service.

InfluxQL: SQL-like syntax for queries such as selecting the mean CPU usage from the last hour, grouped by 1-minute intervals and host.

Key operations:

Rate: Calculate per-second rate from counters
Aggregation: Sum, avg, min, max, percentiles
Grouping: Aggregate across labels
Time bucketing: Group into time windows

Query Execution

Loading diagram...

Query Optimization

Downsample at query time: Do not return 100K points for a dashboard chart
Push down filters: Evaluate label filters at storage layer
Use pre-aggregated data: For queries over long time ranges
Cache hot queries: Dashboard queries repeat every refresh

Step 7: Alerting

Alert Configuration

An alert definition includes a name (like "HighCPUUsage"), an expression that evaluates the condition (average CPU usage in production exceeding 90%), a duration requirement (the condition must hold for 5 minutes), labels for routing (severity: warning), and annotations providing human-readable context (a summary message identifying the affected host).

Key properties:

Threshold: When to fire (> 90%)
Duration: How long before alerting (avoid flapping)
Labels: For routing to the correct team
Annotations: Human-readable context

Alert Evaluation

Loading diagram...

Evaluate rules on a schedule (every 30 seconds). Track state transitions:

OK -> Pending: Condition first becomes true
Pending -> Firing: Condition true for duration
Firing -> OK: Condition becomes false (resolved)

Notification Routing

Loading diagram...

Route based on labels. Critical alerts page someone. Warnings go to Slack. Group related alerts to avoid noise.

Alert Reliability

The alerting system must be more reliable than what it monitors.

Dead man's switch: Alert if no metrics received in 5 minutes
Cross-region alerting: Evaluate alerts from multiple regions
Synthetic monitoring: Probe the system externally

Step 8: Tiered Storage

Hot, Warm, Cold Architecture

Loading diagram...

Tier	Storage	Resolution	Query Speed	Cost
Hot	SSD	10 seconds	< 100ms	$$$
Warm	HDD	1 minute	< 1s	$$
Cold	S3	1 hour	< 10s	$

Query planner routes to appropriate tier based on time range.

Production Examples

System	Notable Design Choice
Prometheus	Pull-based, local storage, PromQL
Datadog	Managed SaaS, agent-based, 15-month retention
InfluxDB	Push-based, flux query language, clustering
Grafana Mimir	Prometheus-compatible, horizontally scalable
VictoriaMetrics	Prometheus-compatible, excellent compression

Summary: Key Design Decisions

Decision	Options	Recommendation
Collection	Push, pull	Both - pull for services, push for agents
Ingestion	Direct write, buffered	Kafka buffer with batched writes
Storage	TSDB, columnar OLAP, custom	ClickHouse for scale
Retention	Single tier, multi-tier	Hot/warm/cold with downsampling
Alerting	Poll-based, streaming	Poll-based with state machine

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Architecture​

Step 3: Data Model​

Metric Structure​

Time Series Storage​

Step 4: Ingestion Pipeline​

Push vs Pull​

Handling 10M writes/second​

Pre-Aggregation​

Step 5: Time Series Storage​

Storage Options​

Compression Techniques​

Data Layout​

Step 6: Query Engine​

Query Language​

Query Execution​

Query Optimization​

Step 7: Alerting​

Alert Configuration​

Alert Evaluation​

Notification Routing​

Alert Reliability​

Step 8: Tiered Storage​

Hot, Warm, Cold Architecture​

Production Examples​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Architecture

Step 3: Data Model

Metric Structure

Time Series Storage

Step 4: Ingestion Pipeline

Push vs Pull

Handling 10M writes/second

Pre-Aggregation

Step 5: Time Series Storage

Storage Options

Compression Techniques

Data Layout

Step 6: Query Engine

Query Language

Query Execution

Query Optimization

Step 7: Alerting

Alert Configuration

Alert Evaluation

Notification Routing

Alert Reliability

Step 8: Tiered Storage

Hot, Warm, Cold Architecture

Production Examples

Summary: Key Design Decisions