Design a Metrics and Monitoring System
Build a system that collects metrics from thousands of services, stores them efficiently, and alerts when thresholds are exceeded.
Related Concepts: Database Scaling, Message Queues, Availability Patterns
Step 1: Requirements and Scope
Functional Requirements
| Requirement | Description |
|---|---|
| Metric ingestion | Collect metrics from services, hosts, containers |
| Storage | Store time-series data for querying |
| Querying | Support aggregations, groupings, time ranges |
| Dashboards | Visualize metrics over time |
| Alerting | Trigger alerts when metrics cross thresholds |
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Write throughput | 10M metrics/second | Large fleets generate massive data |
| Query latency | < 1s for common queries | Dashboard responsiveness |
| Retention | 30 days hot, 1 year cold | Balance storage cost with history |
| Availability | 99.9% | Missing alerts is unacceptable |
Scale Estimation
- Monitored hosts: 100,000 servers
- Metrics per host: 1,000 metrics
- Collection interval: 10 seconds
- Write rate: 100K hosts x 1K metrics / 10s = 10M writes/second
- Data points per day: 10M x 8,640 seconds = 86 trillion points
- Storage (compressed): ~10TB per day
Step 2: High-Level Architecture
Step 3: Data Model
Metric Structure
A metric is a named measurement with dimensions. It consists of a metric name (like "cpu.usage"), labels (key-value pairs like host, region, and environment), a value (such as 72.5), and a timestamp.
Cardinality: Each unique combination of labels is a separate time series. For example, a host label with 100,000 possible hosts creates 100,000 separate time series for a single metric name.
Time Series Storage
Store series metadata separately from data points.
Series table: Contains a series ID as the primary key, the metric name, and labels stored as JSON. For example, series "abc123" maps to metric "cpu.usage" with labels indicating the host and other dimensions.
Points table: Contains the series ID, timestamp, and value. Multiple rows share the same series ID but have different timestamps and values. This structure allows efficient time-range queries for a specific series.
Step 4: Ingestion Pipeline
Push vs Pull
| Approach | Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| Push | Services send metrics to collector | Works through firewalls, NAT | Requires agent everywhere |
| Pull | Collector scrapes /metrics endpoint | Simple, no agent needed | Does not work behind NAT |
Recommendation: Support both. Pull for services you control, push for everything else. Prometheus uses pull, Datadog uses push.
Handling 10M writes/second
Direct writes of 10M rows per second to a database are not feasible. Buffer and aggregate.
Key techniques:
- Batching: Writers accumulate points and write in batches of 1000+
- Partitioning: Kafka partitions by series ID for parallelism
- Pre-aggregation: Aggregate raw 10-second data to 1-minute data before storing
Pre-Aggregation
10-second granularity is not needed for last week's data.
| Time Range | Resolution | Storage Impact |
|---|---|---|
| Last 1 hour | 10 seconds | 360 points |
| Last 24 hours | 1 minute | 1,440 points |
| Last 7 days | 5 minutes | 2,016 points |
| Last 30 days | 1 hour | 720 points |
Aggregate as data ages. Store min, max, sum, count for each bucket to compute any aggregation later.
Step 5: Time Series Storage
Time series data is write-heavy and append-only. General-purpose databases are suboptimal.
Storage Options
| Option | Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| InfluxDB | Purpose-built TSDB | Easy to use, good compression | Scaling limits |
| TimescaleDB | PostgreSQL extension | SQL queries, mature ecosystem | Not as optimized |
| ClickHouse | Columnar OLAP | Fast aggregations, excellent compression | Operational complexity |
| Custom (LSM) | Log-structured merge tree | Maximum control | Build and maintain yourself |
Recommendation: ClickHouse for large scale. Excellent compression and aggregation performance. Handles 10M+ writes/second with proper partitioning.
Compression Techniques
Time series compress extremely well:
Gorilla compression (from Facebook): XOR adjacent values, most bits are zeros. Achieves 10-12x compression on typical metrics.
Data Layout
Partition by time: Queries are usually time-bounded. Old partitions are dropped or archived.
Shard by series: Spread load across nodes. Queries for a specific series go to one shard.
Step 6: Query Engine
Query Language
Most time series databases have their own query language optimized for time series operations.
PromQL (Prometheus): Queries like computing the average rate of HTTP 500 errors over the last 5 minutes, grouped by service.
InfluxQL: SQL-like syntax for queries such as selecting the mean CPU usage from the last hour, grouped by 1-minute intervals and host.
Key operations:
- Rate: Calculate per-second rate from counters
- Aggregation: Sum, avg, min, max, percentiles
- Grouping: Aggregate across labels
- Time bucketing: Group into time windows
Query Execution
Query Optimization
- Downsample at query time: Do not return 100K points for a dashboard chart
- Push down filters: Evaluate label filters at storage layer
- Use pre-aggregated data: For queries over long time ranges
- Cache hot queries: Dashboard queries repeat every refresh
Step 7: Alerting
Alert Configuration
An alert definition includes a name (like "HighCPUUsage"), an expression that evaluates the condition (average CPU usage in production exceeding 90%), a duration requirement (the condition must hold for 5 minutes), labels for routing (severity: warning), and annotations providing human-readable context (a summary message identifying the affected host).
Key properties:
- Threshold: When to fire (> 90%)
- Duration: How long before alerting (avoid flapping)
- Labels: For routing to the correct team
- Annotations: Human-readable context
Alert Evaluation
Evaluate rules on a schedule (every 30 seconds). Track state transitions:
- OK -> Pending: Condition first becomes true
- Pending -> Firing: Condition true for duration
- Firing -> OK: Condition becomes false (resolved)
Notification Routing
Route based on labels. Critical alerts page someone. Warnings go to Slack. Group related alerts to avoid noise.
Alert Reliability
The alerting system must be more reliable than what it monitors.
- Dead man's switch: Alert if no metrics received in 5 minutes
- Cross-region alerting: Evaluate alerts from multiple regions
- Synthetic monitoring: Probe the system externally
Step 8: Tiered Storage
Hot, Warm, Cold Architecture
| Tier | Storage | Resolution | Query Speed | Cost |
|---|---|---|---|---|
| Hot | SSD | 10 seconds | < 100ms | $$$ |
| Warm | HDD | 1 minute | < 1s | $$ |
| Cold | S3 | 1 hour | < 10s | $ |
Query planner routes to appropriate tier based on time range.
Production Examples
| System | Notable Design Choice |
|---|---|
| Prometheus | Pull-based, local storage, PromQL |
| Datadog | Managed SaaS, agent-based, 15-month retention |
| InfluxDB | Push-based, flux query language, clustering |
| Grafana Mimir | Prometheus-compatible, horizontally scalable |
| VictoriaMetrics | Prometheus-compatible, excellent compression |
Summary: Key Design Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Collection | Push, pull | Both - pull for services, push for agents |
| Ingestion | Direct write, buffered | Kafka buffer with batched writes |
| Storage | TSDB, columnar OLAP, custom | ClickHouse for scale |
| Retention | Single tier, multi-tier | Hot/warm/cold with downsampling |
| Alerting | Poll-based, streaming | Poll-based with state machine |