Skip to main content

Design a Metrics and Monitoring System

Build a system that collects metrics from thousands of services, stores them efficiently, and alerts when thresholds are exceeded.

Related Concepts: Database Scaling, Message Queues, Availability Patterns

Step 1: Requirements and Scope

Functional Requirements

RequirementDescription
Metric ingestionCollect metrics from services, hosts, containers
StorageStore time-series data for querying
QueryingSupport aggregations, groupings, time ranges
DashboardsVisualize metrics over time
AlertingTrigger alerts when metrics cross thresholds

Non-Functional Requirements

RequirementTargetRationale
Write throughput10M metrics/secondLarge fleets generate massive data
Query latency< 1s for common queriesDashboard responsiveness
Retention30 days hot, 1 year coldBalance storage cost with history
Availability99.9%Missing alerts is unacceptable

Scale Estimation

  • Monitored hosts: 100,000 servers
  • Metrics per host: 1,000 metrics
  • Collection interval: 10 seconds
  • Write rate: 100K hosts x 1K metrics / 10s = 10M writes/second
  • Data points per day: 10M x 8,640 seconds = 86 trillion points
  • Storage (compressed): ~10TB per day

Step 2: High-Level Architecture

Loading diagram...

Step 3: Data Model

Metric Structure

A metric is a named measurement with dimensions. It consists of a metric name (like "cpu.usage"), labels (key-value pairs like host, region, and environment), a value (such as 72.5), and a timestamp.

Cardinality: Each unique combination of labels is a separate time series. For example, a host label with 100,000 possible hosts creates 100,000 separate time series for a single metric name.

Time Series Storage

Loading diagram...

Store series metadata separately from data points.

Series table: Contains a series ID as the primary key, the metric name, and labels stored as JSON. For example, series "abc123" maps to metric "cpu.usage" with labels indicating the host and other dimensions.

Points table: Contains the series ID, timestamp, and value. Multiple rows share the same series ID but have different timestamps and values. This structure allows efficient time-range queries for a specific series.

Step 4: Ingestion Pipeline

Push vs Pull

ApproachMechanismAdvantagesDisadvantages
PushServices send metrics to collectorWorks through firewalls, NATRequires agent everywhere
PullCollector scrapes /metrics endpointSimple, no agent neededDoes not work behind NAT

Recommendation: Support both. Pull for services you control, push for everything else. Prometheus uses pull, Datadog uses push.

Handling 10M writes/second

Direct writes of 10M rows per second to a database are not feasible. Buffer and aggregate.

Loading diagram...

Key techniques:

  • Batching: Writers accumulate points and write in batches of 1000+
  • Partitioning: Kafka partitions by series ID for parallelism
  • Pre-aggregation: Aggregate raw 10-second data to 1-minute data before storing

Pre-Aggregation

10-second granularity is not needed for last week's data.

Time RangeResolutionStorage Impact
Last 1 hour10 seconds360 points
Last 24 hours1 minute1,440 points
Last 7 days5 minutes2,016 points
Last 30 days1 hour720 points

Aggregate as data ages. Store min, max, sum, count for each bucket to compute any aggregation later.

Step 5: Time Series Storage

Time series data is write-heavy and append-only. General-purpose databases are suboptimal.

Storage Options

OptionMechanismAdvantagesDisadvantages
InfluxDBPurpose-built TSDBEasy to use, good compressionScaling limits
TimescaleDBPostgreSQL extensionSQL queries, mature ecosystemNot as optimized
ClickHouseColumnar OLAPFast aggregations, excellent compressionOperational complexity
Custom (LSM)Log-structured merge treeMaximum controlBuild and maintain yourself

Recommendation: ClickHouse for large scale. Excellent compression and aggregation performance. Handles 10M+ writes/second with proper partitioning.

Compression Techniques

Time series compress extremely well:

Loading diagram...

Gorilla compression (from Facebook): XOR adjacent values, most bits are zeros. Achieves 10-12x compression on typical metrics.

Data Layout

Loading diagram...

Partition by time: Queries are usually time-bounded. Old partitions are dropped or archived.

Shard by series: Spread load across nodes. Queries for a specific series go to one shard.

Step 6: Query Engine

Query Language

Most time series databases have their own query language optimized for time series operations.

PromQL (Prometheus): Queries like computing the average rate of HTTP 500 errors over the last 5 minutes, grouped by service.

InfluxQL: SQL-like syntax for queries such as selecting the mean CPU usage from the last hour, grouped by 1-minute intervals and host.

Key operations:

  • Rate: Calculate per-second rate from counters
  • Aggregation: Sum, avg, min, max, percentiles
  • Grouping: Aggregate across labels
  • Time bucketing: Group into time windows

Query Execution

Loading diagram...

Query Optimization

  1. Downsample at query time: Do not return 100K points for a dashboard chart
  2. Push down filters: Evaluate label filters at storage layer
  3. Use pre-aggregated data: For queries over long time ranges
  4. Cache hot queries: Dashboard queries repeat every refresh

Step 7: Alerting

Alert Configuration

An alert definition includes a name (like "HighCPUUsage"), an expression that evaluates the condition (average CPU usage in production exceeding 90%), a duration requirement (the condition must hold for 5 minutes), labels for routing (severity: warning), and annotations providing human-readable context (a summary message identifying the affected host).

Key properties:

  • Threshold: When to fire (> 90%)
  • Duration: How long before alerting (avoid flapping)
  • Labels: For routing to the correct team
  • Annotations: Human-readable context

Alert Evaluation

Loading diagram...

Evaluate rules on a schedule (every 30 seconds). Track state transitions:

  • OK -> Pending: Condition first becomes true
  • Pending -> Firing: Condition true for duration
  • Firing -> OK: Condition becomes false (resolved)

Notification Routing

Loading diagram...

Route based on labels. Critical alerts page someone. Warnings go to Slack. Group related alerts to avoid noise.

Alert Reliability

The alerting system must be more reliable than what it monitors.

  • Dead man's switch: Alert if no metrics received in 5 minutes
  • Cross-region alerting: Evaluate alerts from multiple regions
  • Synthetic monitoring: Probe the system externally

Step 8: Tiered Storage

Hot, Warm, Cold Architecture

Loading diagram...
TierStorageResolutionQuery SpeedCost
HotSSD10 seconds< 100ms$$$
WarmHDD1 minute< 1s$$
ColdS31 hour< 10s$

Query planner routes to appropriate tier based on time range.

Production Examples

SystemNotable Design Choice
PrometheusPull-based, local storage, PromQL
DatadogManaged SaaS, agent-based, 15-month retention
InfluxDBPush-based, flux query language, clustering
Grafana MimirPrometheus-compatible, horizontally scalable
VictoriaMetricsPrometheus-compatible, excellent compression

Summary: Key Design Decisions

DecisionOptionsRecommendation
CollectionPush, pullBoth - pull for services, push for agents
IngestionDirect write, bufferedKafka buffer with batched writes
StorageTSDB, columnar OLAP, customClickHouse for scale
RetentionSingle tier, multi-tierHot/warm/cold with downsampling
AlertingPoll-based, streamingPoll-based with state machine