Skip to main content

Design Google Drive

Design a cloud file storage and synchronization service that lets users store files, sync across devices, and collaborate.

Related Concepts: Block-Level Storage, Deduplication, Delta Sync, Conflict Resolution, Versioning, Chunking, Metadata Database, Long Polling/WebSocket, Offline Queue, Content-Addressable Storage

Step 1: Requirements and Scope

Functional Requirements

  • Upload and download files
  • Sync files across multiple devices
  • File versioning (view and restore previous versions)
  • Share files with other users
  • Notifications when files change
  • Support offline editing (sync when back online)

Non-Functional Requirements

RequirementTargetRationale
Availability99.99%Users must not lose access to files
Durability99.999999999% (11 9s)Never lose user data
LatencySync < 5s for small filesMust feel instant
ConsistencyEventual (strong for metadata)Sync timing flexible

Scale Estimation

  • 500 million users
  • 100 million daily active users
  • Average user has 200 files, 2 GB total
  • Average file size: 10 MB
  • 2 million files uploaded per minute at peak

Storage:

  • 500M users x 2 GB = 1 EB (exabyte) of raw storage
  • With replication (3x): 3 EB

Upload bandwidth:

  • 2M files/min x 10 MB = 20 TB/min at peak
  • ~333 GB/second

Step 2: High-Level Architecture

Loading diagram...

Key Components:

  • Metadata Service: Tracks files, folders, versions, permissions
  • Block Storage: Stores actual file content (chunked)
  • Sync Service: Coordinates changes across devices
  • Notification Service: Pushes updates to connected clients

Step 3: File Upload Design

The Block Approach

Files are broken into blocks rather than stored as single blobs.

Benefits of blocks:

BenefitDescription
Incremental syncOnly upload changed blocks
DeduplicationIdentical blocks stored once
Parallel uploadUpload multiple blocks simultaneously
ResumeNetwork failure does not restart entire file

Block Size Trade-offs

Block SizeProsCons
Small (256 KB)Better dedup, granular syncMore metadata overhead
Medium (4 MB)Balanced-
Large (64 MB)Less overheadPoor incremental sync

Recommendation: 4 MB blocks (Dropbox uses 4 MB)

Upload Flow

Loading diagram...

Deduplication

When a file is uploaded, each block is hashed. If that hash already exists in the system (from any user), it is not stored again.

Loading diagram...

Deduplication rates:

  • Average: 30-50% storage savings
  • Code repositories: 70-80% savings (many similar files)
  • Media files: 10-20% savings (high entropy)

Step 4: Sync Protocol

Sync Challenges

ChallengeExampleSolution
Multiple devicesEdit on laptop and phoneServer is source of truth
Offline editingEdit without internetQueue changes, sync later
ConflictsTwo people edit same fileConflict resolution
Large folders100K files in folderIncremental sync

Change Detection

Two approaches:

ApproachHow It WorksProsCons
PollingClient asks "what changed?" periodicallySimpleWastes bandwidth, slow
PushServer notifies clients of changesReal-timeRequires persistent connection

Recommendation: Push for real-time, polling as fallback

Sync Flow

Loading diagram...

Version Vectors

Each device tracks what it has synced using version vectors. A version vector maps each participant (server and devices) to the last known version number. For example, Device 1 might know that the server is at version 15 and Device 1 itself is at version 3. Device 2 knows the server is at version 15 and Device 2 is at version 7. The server tracks all participants: server at version 20, Device 1 at version 3, and Device 2 at version 7.

When syncing, devices compare version vectors to determine what changes are new and need to be transferred.

Step 5: Metadata Design

File Metadata Schema

The namespaces table represents workspaces or user accounts, storing a namespace ID, owner user ID, and storage quota in bytes.

The files table stores file and folder metadata:

  • file_id: Unique identifier (primary key)
  • namespace_id: Which workspace this belongs to
  • parent_folder_id: The containing folder (null for root)
  • name: File or folder name
  • is_folder: Boolean distinguishing files from folders
  • size_bytes: Total file size
  • latest_version: Current version number
  • created_at and modified_at: Timestamps An index on namespace_id and parent_folder_id enables efficient folder listing queries.

The file_versions table tracks version history with a composite primary key of file_id and version number. Each version stores a JSON block_list (ordered list of block hashes that compose the file), the user who created this version, and a timestamp.

The blocks table uses the SHA256 hash as the primary key for content-addressed storage. It stores the block size, storage path, and a reference count tracking how many file versions reference this block (enabling garbage collection when count reaches zero).

Why Separate Blocks from Files?

DesignStorageDedupQuery Pattern
File blobWastefulNoneSimple
Block referencesEfficientYesMore complex

A file is a list of block references. Same blocks can be referenced by many files.

Folder Structure

Two approaches:

ApproachHow It WorksProsCons
Path-basedStore full path /a/b/c/file.txtSimple queriesRename folder = update all children
Parent pointerEach file points to parentRename is O(1)Need recursive queries

Recommendation: Parent pointer with path caching

Step 6: Conflict Resolution

When Conflicts Happen

Loading diagram...

Conflict Strategies

StrategyHow It WorksUsed By
Last write winsLatest timestamp winsSimple but lossy
Create conflict copyfile (conflicted copy).txtDropbox
Auto-mergeMerge changes if possibleGoogle Docs
User choosesShow both versions, user picksGit

Dropbox-Style Resolution

Loading diagram...

Reducing Conflicts

TechniqueDescription
Sync frequentlyMore sync = smaller windows for conflict
Lock filesPrevent concurrent editing (poor UX)
Real-time collaborationGoogle Docs approach (different design)

Step 7: Notification System

Push Notifications

Users need to know when files change. Long polling or WebSockets are used.

Loading diagram...

Notification Payload

Keep the payload small, containing just enough information to trigger sync. A typical notification includes the event type (such as "file_changed"), file ID, namespace ID, new version number, and timestamp. The client then compares this against its local state to decide whether to download the updated file.

Step 8: Storage Architecture

Block Storage Layer

Loading diagram...

Storage Durability

ComponentDurabilityHow
S3 Standard99.999999999%3+ copies across AZs
Cross-regionSurvives region failureAsync replication
VersioningRecover deleted filesKeep old versions
ChecksumsDetect corruptionSHA256 on blocks

Encryption

LayerEncryptionKey Management
In transitTLS 1.3Standard
At restAES-256Per-user keys
Client-side (optional)User-managedZero-knowledge

Step 9: Sharing and Permissions

Permission Model

The permissions table controls access with the following columns:

  • file_id: The file or folder being shared
  • grantee_type: Who receives access (user, group, or anyone)
  • grantee_id: The specific user or group ID (null for "anyone" type)
  • permission: Access level granted (view, edit, or owner)

The composite primary key ensures each file can have multiple permission entries but only one entry per grantee.

Sharing Flow

Loading diagram...

Public links that do not require authentication:

Link TypeAccessExpiration
View onlyRead file/folderOptional
EditFull accessOptional
DownloadDirect downloadTime-limited

Step 10: Handling Edge Cases

Large Files (>1 GB)

ChallengeSolution
Upload timeoutChunked upload with resume
MemoryStream blocks, do not load entire file
ProcessingBackground workers for large files

Many Small Files

ChallengeSolution
Metadata overheadBatch metadata operations
Block storage overheadPack small files together
Sync latencyBatch sync for bulk operations

Offline Sync

Loading diagram...
StateBehavior
OnlineReal-time sync
OfflineQueue changes locally
SyncingUpload queued changes
ConflictResolve and retry

Real-World Systems

CompanyNotable Design Choice
Dropbox4 MB blocks, custom sync protocol, extensive dedup
Google DriveIntegrated with Docs for real-time collab
OneDriveDeep Windows/Office integration, differential sync
iCloudAggressive battery optimization for mobile
BoxEnterprise focus, granular permissions, compliance

Summary: Key Design Decisions

DecisionOptionsRecommendation
File storageWhole file, Blocks4 MB blocks
Sync approachPolling, PushPush with polling fallback
Conflict resolutionLast write wins, Conflict copyConflict copy
DeduplicationNone, Block-levelBlock-level SHA256
NotificationPolling, WebSocket, Long pollWebSocket + long poll fallback
Metadata DBDocument, RelationalRelational (MySQL/Postgres)