Design a Notification System
Related Concepts: Message Queues (Kafka/SQS), Fan-Out Pattern, Push Services (APNs/FCM), Rate Limiting, Template Engine, User Preferences, Retry with Backoff, Idempotency, Dead Letter Queue
Design a system that sends notifications to users via multiple channels: push notifications, SMS, and email.
Step 1: Requirements and Scope
Functional Requirements
- Support multiple channels: push (iOS/Android), SMS, email
- Soft real-time delivery (slight delays acceptable)
- Users can configure notification preferences
- Support scheduled notifications
- Rate limiting to prevent notification spam
- Template-based content management
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Availability | 99.99% | Critical alerts must always work |
| Scalability | 1B notifications/day | Handle viral events, marketing blasts |
| Latency | < 30 seconds p99 | User expects timely notifications |
| Delivery | At-least-once | Better to duplicate than miss |
| Reliability | Handle provider failures | No single point of failure |
Scale Estimation
- 100 million users
- 1 billion notifications/day (~12K/second average)
- Peak: 10x average (during events)
- Storage: 1B x 1KB = 1TB/day for notification logs
Step 2: Notification Channel Comparison
| Channel | Provider | Cost | Reach | Latency | Content |
|---|---|---|---|---|---|
| iOS Push | APNs | Free | Requires app | Sub-second | 4KB payload |
| Android Push | FCM | Free | Requires app | Sub-second | 4KB payload |
| SMS | Twilio, Nexmo | $0.01-0.05/msg | Universal | 1-5 seconds | 160 chars |
| SendGrid, SES | $0.0001/msg | Universal | Seconds-minutes | Unlimited |
When to Use Each Channel
| Use Case | Channel | Reason |
|---|---|---|
| OTP/Security alerts | SMS | Most reliable, no app needed |
| Time-sensitive alerts | Push | Instant, free |
| Marketing campaigns | Rich content, cost-effective | |
| Critical transactions | Push + SMS fallback | Redundancy |
Step 3: High-Level Architecture
Loading diagram...
Step 4: Notification Flow
Loading diagram...
Step 5: User Preferences Design
Preference Settings
| Setting | Options | Default | Purpose |
|---|---|---|---|
| Channel enabled | true/false per channel | All enabled | Opt-out control |
| Quiet hours | Start/end time | None | No notifications during sleep |
| Timezone | IANA timezone | UTC | Respect local time |
| Frequency cap | Max per day/hour | Unlimited | Prevent spam |
| Priority threshold | Critical/High/Normal/Low | Low | Only receive above threshold |
Notification Categories
| Category | Default Channel | User Configurable |
|---|---|---|
| Security (OTP, alerts) | SMS + Push | Channel only |
| Transactions | Push + Email | Yes |
| Social (likes, comments) | Push | Yes |
| Marketing | Yes | |
| Reminders | Push | Yes |
Step 6: Device Token Management
Token Lifecycle
Loading diagram...
Token Storage Schema
| Field | Type | Purpose |
|---|---|---|
token_id | UUID | Primary key |
user_id | BIGINT | Owner |
device_token | VARCHAR(255) | Provider-specific token |
platform | ENUM(ios, android) | Determines provider |
app_version | VARCHAR(20) | For feature flags |
created_at | TIMESTAMP | For cleanup |
last_used_at | TIMESTAMP | Activity tracking |
Handling Invalid Tokens
| Provider Response | Action | Reason |
|---|---|---|
| Success | Update last_used_at | Token is valid |
| Invalid token | Delete immediately | Token no longer valid |
| Unregistered | Delete immediately | App uninstalled |
| Rate limited | Retry with backoff | Provider overloaded |
| Server error | Retry with backoff | Temporary issue |
Step 7: Rate Limiting Strategy
Multi-Level Rate Limits
Loading diagram...
Rate Limit Configuration
| Level | Push | SMS | Reason | |
|---|---|---|---|---|
| Per user/hour | 10 | 5 | 20 | Prevent spam |
| Per user/day | 50 | 10 | 100 | Daily cap |
| Marketing/day | 1 | 0 | 3 | User experience |
| Provider/second | N/A | 100 | 1000 | API limits |
Step 8: Reliability Patterns
Retry Strategy
| Failure Type | Strategy | Max Retries | Backoff |
|---|---|---|---|
| Network timeout | Immediate retry | 3 | None |
| Provider 5xx | Exponential backoff | 5 | 1s, 2s, 4s, 8s, 16s |
| Rate limited | Fixed delay | 10 | Provider's retry-after |
| Invalid token | No retry | 0 | Delete token |
Provider Failover
Loading diagram...
Idempotency
| Scenario | Without Idempotency | With Idempotency |
|---|---|---|
| Network retry | User gets duplicate | Only one delivered |
| Worker crash | May resend | Safe to replay |
| Queue replay | Duplicate notifications | Deduplicated |
Implementation: Store notification_id in Redis with TTL; skip if already processed.
Step 9: Priority Queue Design
Loading diagram...
Priority SLAs
| Priority | Max Latency | Use Cases |
|---|---|---|
| P1 Critical | 5 seconds | OTP, fraud alerts, password reset |
| P2 High | 30 seconds | Order confirmation, payment receipt |
| P3 Normal | 5 minutes | Likes, comments, follows |
| P4 Low | 30 minutes | Marketing, weekly digest |
Step 10: Analytics and Monitoring
Delivery Funnel
Loading diagram...
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Delivery rate | > 95% | < 90% |
| P99 latency | < 30s | > 60s |
| Provider error rate | < 1% | > 5% |
| Invalid token rate | < 5% | > 10% |
| Queue depth | < 100K | > 500K |
| Rate limit hits | < 1% | > 5% |
Step 11: Template System
Template Structure
| Component | Push | SMS | |
|---|---|---|---|
| Title | Yes (50 chars) | No | Yes (subject) |
| Body | Yes (200 chars) | Yes (160 chars) | Yes (HTML) |
| Rich content | Image, action buttons | No | Images, links, formatting |
| Personalization | Yes | Yes | Yes |
Localization
| Field | Storage | Fallback |
|---|---|---|
| Template ID | order_shipped | - |
| Language | User preference | en-US |
| Variables | {order_id}, {name} | Required |
Production Examples
| Company | Scale | Notable Features |
|---|---|---|
| Twilio Notify | Millions/day | Multi-channel API, delivery receipts |
| Firebase | Billions/day | Free push, topic subscriptions |
| Amazon SNS | Billions/day | AWS integration, fanout patterns |
| OneSignal | 10B+/day | Segmentation, A/B testing |
| Airship | Enterprise | ML-based send time optimization |
Summary: Key Design Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Delivery guarantee | At-most-once, At-least-once | At-least-once with idempotency |
| Queue architecture | Single queue, Per-channel, Per-priority | Per-channel with priority levels |
| Provider strategy | Single provider, Multi with failover | Multi-provider with automatic failover |
| Rate limiting | User-level, Global, Both | Multi-level (user + category + provider) |
| Token storage | In-memory, Database | Database with caching |
| Logging | All notifications, Failures only | All notifications for audit trail |