Skip to main content

Design a Chat System

Design a real-time chat application like WhatsApp, Slack, or Facebook Messenger.

Related Concepts: WebSocket Connections, Long Polling, Message Queues, Presence Service, Read Receipts, Push Notifications, Message Storage, End-to-End Encryption, Connection Gateway

Step 1: Requirements and Scope

Functional Requirements

  • One-on-one messaging
  • Group messaging (up to 500 members)
  • Online presence indicators
  • Read receipts (sent, delivered, read)
  • Push notifications for offline users
  • Message history with search

Non-Functional Requirements

RequirementTargetRationale
Latency< 100ms deliveryReal-time feel
Availability99.99%Communication is critical
OrderingPer-conversationMessages must appear in order
DeliveryAt-least-onceNo lost messages
PersistenceForever (or user-deleted)Message history matters

Scale Estimation

  • 100 million daily active users
  • 50 billion messages per day (~600K/second)
  • Average message size: 100 bytes
  • Peak: 5x average (1M messages/second)
  • Storage: 50B x 100 bytes = 5TB/day

Step 2: Communication Protocols

Protocol Comparison

ProtocolConnectionLatencyServer LoadUse Case
HTTP PollingNew per requestHighVery HighFallback only
Long PollingHeld openMediumMediumSimple real-time
WebSocketPersistentLowLowChat, gaming
Server-Sent EventsPersistent (one-way)LowLowNotifications
Loading diagram...
ProsCons
Full-duplex communicationStateful (harder to scale)
Low latencyConnection management complexity
Efficient (no HTTP overhead)Need fallback for firewalls

Step 3: High-Level Architecture

Loading diagram...

Step 4: Message Flow

One-on-One Messaging

Loading diagram...

Group Messaging

Loading diagram...

Fan-Out Strategy for Groups

Group SizeStrategyRationale
Small (< 100)Fan-out on writePre-deliver to all members
Large (100-500)Fan-out on writeStill manageable
Very large (500+)Fan-out on readAvoid write amplification

Step 5: Message Storage

Schema Design

TablePurposeKey Structure
messagesStore all messages(conversation_id, message_id)
conversationsConversation metadata(conversation_id)
conversation_membersWho is in each conversation(conversation_id, user_id)
user_conversationsUser's conversation list(user_id, conversation_id)

Message Table Structure

ColumnTypePurpose
message_idBIGINTSnowflake ID (time-sorted)
conversation_idBIGINTGroups messages together
sender_idBIGINTWho sent it
contentTEXTMessage body
content_typeENUMtext, image, video, file
created_atTIMESTAMPFor display
statusENUMsent, delivered, read

Database Choice

DatabaseProsConsBest For
CassandraHigh write throughput, linear scalingEventual consistencyMessage storage
PostgreSQLACID, familiarVertical scaling limitsMetadata, small scale
ScyllaDBCassandra-compatible, fasterLess matureHigh performance
TiDBMySQL-compatible, distributedNewerHybrid workloads

Recommendation: Cassandra for messages, PostgreSQL for user/conversation metadata

Step 6: Online Presence

Presence Architecture

Loading diagram...

Presence States

StateIndicatorStorage
OnlineGreen dotuser:{id}:status = online, TTL 10s
AwayYellow dotuser:{id}:status = away, TTL 30s
OfflineGray dotKey expired or deleted
Last seenTimestampuser:{id}:last_seen = timestamp

Scaling Presence Updates

ApproachDescriptionWhen to Use
EagerPush status to all friends on changeSmall friend lists (< 100)
LazySend status when friend opens chatLarge friend lists
HybridEager for close friends, lazy for othersBest of both

Step 7: Read Receipts

Message States

Loading diagram...

Read Receipt Storage

FieldTypePurpose
conversation_idBIGINTWhich conversation
user_idBIGINTWho read
last_read_message_idBIGINTUp to which message
read_atTIMESTAMPWhen they read

Efficiency: Last Read Pointer

Instead of tracking each message, store last read position:

ApproachStorageUpdatesAccuracy
Per-message statusHighManyExact
Last-read pointerLowOne per readApproximate

Recommendation: Last-read pointer for efficiency

Step 8: Multi-Device Sync

Challenge

User A has phone and laptop both online. Both devices must stay in sync.

Loading diagram...

Sync Protocol

EventAction
Device connectsSend messages since last_sync_id
New message receivedBroadcast to all user's devices
Message sentSync to other devices
Device offlineQueue messages, sync on reconnect

Step 9: Connection Management

Sticky Sessions

WebSocket connections are stateful. Users must connect to the same server.

Loading diagram...

Handling Server Failures

ScenarioHandling
Server crashesClient reconnects, load balancer routes to new server
Graceful shutdownServer notifies clients, triggers reconnect
Network partitionHeartbeat timeout, client reconnects

Step 10: Push Notifications

Offline Message Delivery

Loading diagram...

Push Notification Content

ElementRecommendation
TitleSender's name
BodyMessage preview (truncated)
BadgeUnread count
SoundUser preference
DataConversation ID for deep link

Real-World Systems

SystemScaleNotable Features
WhatsApp2B usersEnd-to-end encryption, Erlang backend
Slack20M DAUChannels, threads, extensive integrations
Discord150M MAUVoice/video, server-based communities
Telegram800M MAUCloud-based, bots, channels
Facebook Messenger1B+ usersIntegrated with Facebook, rich media

Summary: Key Design Decisions

DecisionOptionsRecommendation
ProtocolHTTP polling, WebSocketWebSocket with HTTP fallback
Message storageSQL, NoSQLCassandra for messages, PostgreSQL for metadata
Group fan-outOn write, On readOn write for groups < 500
PresenceEager, LazyHybrid based on relationship
Read receiptsPer-message, PointerLast-read pointer
Connection routingRandom, StickyConsistent hash for sticky sessions