Skip to main content

Design a Notification System

Related Concepts: Message Queues (Kafka/SQS), Fan-Out Pattern, Push Services (APNs/FCM), Rate Limiting, Template Engine, User Preferences, Retry with Backoff, Idempotency, Dead Letter Queue

Design a system that sends notifications to users via multiple channels: push notifications, SMS, and email.

Step 1: Requirements and Scope

Functional Requirements

  • Support multiple channels: push (iOS/Android), SMS, email
  • Soft real-time delivery (slight delays acceptable)
  • Users can configure notification preferences
  • Support scheduled notifications
  • Rate limiting to prevent notification spam
  • Template-based content management

Non-Functional Requirements

RequirementTargetRationale
Availability99.99%Critical alerts must always work
Scalability1B notifications/dayHandle viral events, marketing blasts
Latency< 30 seconds p99User expects timely notifications
DeliveryAt-least-onceBetter to duplicate than miss
ReliabilityHandle provider failuresNo single point of failure

Scale Estimation

  • 100 million users
  • 1 billion notifications/day (~12K/second average)
  • Peak: 10x average (during events)
  • Storage: 1B x 1KB = 1TB/day for notification logs

Step 2: Notification Channel Comparison

ChannelProviderCostReachLatencyContent
iOS PushAPNsFreeRequires appSub-second4KB payload
Android PushFCMFreeRequires appSub-second4KB payload
SMSTwilio, Nexmo$0.01-0.05/msgUniversal1-5 seconds160 chars
EmailSendGrid, SES$0.0001/msgUniversalSeconds-minutesUnlimited

When to Use Each Channel

Use CaseChannelReason
OTP/Security alertsSMSMost reliable, no app needed
Time-sensitive alertsPushInstant, free
Marketing campaignsEmailRich content, cost-effective
Critical transactionsPush + SMS fallbackRedundancy

Step 3: High-Level Architecture

Loading diagram...

Step 4: Notification Flow

Loading diagram...

Step 5: User Preferences Design

Preference Settings

SettingOptionsDefaultPurpose
Channel enabledtrue/false per channelAll enabledOpt-out control
Quiet hoursStart/end timeNoneNo notifications during sleep
TimezoneIANA timezoneUTCRespect local time
Frequency capMax per day/hourUnlimitedPrevent spam
Priority thresholdCritical/High/Normal/LowLowOnly receive above threshold

Notification Categories

CategoryDefault ChannelUser Configurable
Security (OTP, alerts)SMS + PushChannel only
TransactionsPush + EmailYes
Social (likes, comments)PushYes
MarketingEmailYes
RemindersPushYes

Step 6: Device Token Management

Token Lifecycle

Loading diagram...

Token Storage Schema

FieldTypePurpose
token_idUUIDPrimary key
user_idBIGINTOwner
device_tokenVARCHAR(255)Provider-specific token
platformENUM(ios, android)Determines provider
app_versionVARCHAR(20)For feature flags
created_atTIMESTAMPFor cleanup
last_used_atTIMESTAMPActivity tracking

Handling Invalid Tokens

Provider ResponseActionReason
SuccessUpdate last_used_atToken is valid
Invalid tokenDelete immediatelyToken no longer valid
UnregisteredDelete immediatelyApp uninstalled
Rate limitedRetry with backoffProvider overloaded
Server errorRetry with backoffTemporary issue

Step 7: Rate Limiting Strategy

Multi-Level Rate Limits

Loading diagram...

Rate Limit Configuration

LevelPushSMSEmailReason
Per user/hour10520Prevent spam
Per user/day5010100Daily cap
Marketing/day103User experience
Provider/secondN/A1001000API limits

Step 8: Reliability Patterns

Retry Strategy

Failure TypeStrategyMax RetriesBackoff
Network timeoutImmediate retry3None
Provider 5xxExponential backoff51s, 2s, 4s, 8s, 16s
Rate limitedFixed delay10Provider's retry-after
Invalid tokenNo retry0Delete token

Provider Failover

Loading diagram...

Idempotency

ScenarioWithout IdempotencyWith Idempotency
Network retryUser gets duplicateOnly one delivered
Worker crashMay resendSafe to replay
Queue replayDuplicate notificationsDeduplicated

Implementation: Store notification_id in Redis with TTL; skip if already processed.

Step 9: Priority Queue Design

Loading diagram...

Priority SLAs

PriorityMax LatencyUse Cases
P1 Critical5 secondsOTP, fraud alerts, password reset
P2 High30 secondsOrder confirmation, payment receipt
P3 Normal5 minutesLikes, comments, follows
P4 Low30 minutesMarketing, weekly digest

Step 10: Analytics and Monitoring

Delivery Funnel

Loading diagram...

Key Metrics

MetricTargetAlert Threshold
Delivery rate> 95%< 90%
P99 latency< 30s> 60s
Provider error rate< 1%> 5%
Invalid token rate< 5%> 10%
Queue depth< 100K> 500K
Rate limit hits< 1%> 5%

Step 11: Template System

Template Structure

ComponentPushSMSEmail
TitleYes (50 chars)NoYes (subject)
BodyYes (200 chars)Yes (160 chars)Yes (HTML)
Rich contentImage, action buttonsNoImages, links, formatting
PersonalizationYesYesYes

Localization

FieldStorageFallback
Template IDorder_shipped-
LanguageUser preferenceen-US
Variables{order_id}, {name}Required

Production Examples

CompanyScaleNotable Features
Twilio NotifyMillions/dayMulti-channel API, delivery receipts
FirebaseBillions/dayFree push, topic subscriptions
Amazon SNSBillions/dayAWS integration, fanout patterns
OneSignal10B+/daySegmentation, A/B testing
AirshipEnterpriseML-based send time optimization

Summary: Key Design Decisions

DecisionOptionsRecommendation
Delivery guaranteeAt-most-once, At-least-onceAt-least-once with idempotency
Queue architectureSingle queue, Per-channel, Per-priorityPer-channel with priority levels
Provider strategySingle provider, Multi with failoverMulti-provider with automatic failover
Rate limitingUser-level, Global, BothMulti-level (user + category + provider)
Token storageIn-memory, DatabaseDatabase with caching
LoggingAll notifications, Failures onlyAll notifications for audit trail