Design Google Drive

Design a cloud file storage and synchronization service that lets users store files, sync across devices, and collaborate.

Related Concepts: Block-Level Storage, Deduplication, Delta Sync, Conflict Resolution, Versioning, Chunking, Metadata Database, Long Polling/WebSocket, Offline Queue, Content-Addressable Storage

Step 1: Requirements and Scope

Functional Requirements

Upload and download files
Sync files across multiple devices
File versioning (view and restore previous versions)
Share files with other users
Notifications when files change
Support offline editing (sync when back online)

Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99%	Users must not lose access to files
Durability	99.999999999% (11 9s)	Never lose user data
Latency	Sync < 5s for small files	Must feel instant
Consistency	Eventual (strong for metadata)	Sync timing flexible

Scale Estimation

500 million users
100 million daily active users
Average user has 200 files, 2 GB total
Average file size: 10 MB
2 million files uploaded per minute at peak

Storage:

500M users x 2 GB = 1 EB (exabyte) of raw storage
With replication (3x): 3 EB

Upload bandwidth:

2M files/min x 10 MB = 20 TB/min at peak
~333 GB/second

Step 2: High-Level Architecture

Loading diagram...

Key Components:

Metadata Service: Tracks files, folders, versions, permissions
Block Storage: Stores actual file content (chunked)
Sync Service: Coordinates changes across devices
Notification Service: Pushes updates to connected clients

Step 3: File Upload Design

The Block Approach

Files are broken into blocks rather than stored as single blobs.

Benefits of blocks:

Benefit	Description
Incremental sync	Only upload changed blocks
Deduplication	Identical blocks stored once
Parallel upload	Upload multiple blocks simultaneously
Resume	Network failure does not restart entire file

Block Size Trade-offs

Block Size	Pros	Cons
Small (256 KB)	Better dedup, granular sync	More metadata overhead
Medium (4 MB)	Balanced	-
Large (64 MB)	Less overhead	Poor incremental sync

Recommendation: 4 MB blocks (Dropbox uses 4 MB)

Upload Flow

Loading diagram...

Deduplication

When a file is uploaded, each block is hashed. If that hash already exists in the system (from any user), it is not stored again.

Loading diagram...

Deduplication rates:

Average: 30-50% storage savings
Code repositories: 70-80% savings (many similar files)
Media files: 10-20% savings (high entropy)

Step 4: Sync Protocol

Sync Challenges

Challenge	Example	Solution
Multiple devices	Edit on laptop and phone	Server is source of truth
Offline editing	Edit without internet	Queue changes, sync later
Conflicts	Two people edit same file	Conflict resolution
Large folders	100K files in folder	Incremental sync

Change Detection

Two approaches:

Approach	How It Works	Pros	Cons
Polling	Client asks "what changed?" periodically	Simple	Wastes bandwidth, slow
Push	Server notifies clients of changes	Real-time	Requires persistent connection

Recommendation: Push for real-time, polling as fallback

Sync Flow

Loading diagram...

Version Vectors

Each device tracks what it has synced using version vectors. A version vector maps each participant (server and devices) to the last known version number. For example, Device 1 might know that the server is at version 15 and Device 1 itself is at version 3. Device 2 knows the server is at version 15 and Device 2 is at version 7. The server tracks all participants: server at version 20, Device 1 at version 3, and Device 2 at version 7.

When syncing, devices compare version vectors to determine what changes are new and need to be transferred.

Step 5: Metadata Design

File Metadata Schema

The namespaces table represents workspaces or user accounts, storing a namespace ID, owner user ID, and storage quota in bytes.

The files table stores file and folder metadata:

file_id: Unique identifier (primary key)
namespace_id: Which workspace this belongs to
parent_folder_id: The containing folder (null for root)
name: File or folder name
is_folder: Boolean distinguishing files from folders
size_bytes: Total file size
latest_version: Current version number
created_at and modified_at: Timestamps An index on namespace_id and parent_folder_id enables efficient folder listing queries.

The file_versions table tracks version history with a composite primary key of file_id and version number. Each version stores a JSON block_list (ordered list of block hashes that compose the file), the user who created this version, and a timestamp.

The blocks table uses the SHA256 hash as the primary key for content-addressed storage. It stores the block size, storage path, and a reference count tracking how many file versions reference this block (enabling garbage collection when count reaches zero).

Why Separate Blocks from Files?

Design	Storage	Dedup	Query Pattern
File blob	Wasteful	None	Simple
Block references	Efficient	Yes	More complex

A file is a list of block references. Same blocks can be referenced by many files.

Folder Structure

Two approaches:

Approach	How It Works	Pros	Cons
Path-based	Store full path `/a/b/c/file.txt`	Simple queries	Rename folder = update all children
Parent pointer	Each file points to parent	Rename is O(1)	Need recursive queries

Recommendation: Parent pointer with path caching

Step 6: Conflict Resolution

When Conflicts Happen

Loading diagram...

Conflict Strategies

Strategy	How It Works	Used By
Last write wins	Latest timestamp wins	Simple but lossy
Create conflict copy	`file (conflicted copy).txt`	Dropbox
Auto-merge	Merge changes if possible	Google Docs
User chooses	Show both versions, user picks	Git

Dropbox-Style Resolution

Loading diagram...

Reducing Conflicts

Technique	Description
Sync frequently	More sync = smaller windows for conflict
Lock files	Prevent concurrent editing (poor UX)
Real-time collaboration	Google Docs approach (different design)

Step 7: Notification System

Push Notifications

Users need to know when files change. Long polling or WebSockets are used.

Loading diagram...

Notification Payload

Keep the payload small, containing just enough information to trigger sync. A typical notification includes the event type (such as "file_changed"), file ID, namespace ID, new version number, and timestamp. The client then compares this against its local state to decide whether to download the updated file.

Step 8: Storage Architecture

Block Storage Layer

Loading diagram...

Storage Durability

Component	Durability	How
S3 Standard	99.999999999%	3+ copies across AZs
Cross-region	Survives region failure	Async replication
Versioning	Recover deleted files	Keep old versions
Checksums	Detect corruption	SHA256 on blocks

Encryption

Layer	Encryption	Key Management
In transit	TLS 1.3	Standard
At rest	AES-256	Per-user keys
Client-side (optional)	User-managed	Zero-knowledge

Permission Model

The permissions table controls access with the following columns:

file_id: The file or folder being shared
grantee_type: Who receives access (user, group, or anyone)
grantee_id: The specific user or group ID (null for "anyone" type)
permission: Access level granted (view, edit, or owner)

The composite primary key ensures each file can have multiple permission entries but only one entry per grantee.

Loading diagram...

Public links that do not require authentication:

Link Type	Access	Expiration
View only	Read file/folder	Optional
Edit	Full access	Optional
Download	Direct download	Time-limited

Step 10: Handling Edge Cases

Large Files (>1 GB)

Challenge	Solution
Upload timeout	Chunked upload with resume
Memory	Stream blocks, do not load entire file
Processing	Background workers for large files

Many Small Files

Challenge	Solution
Metadata overhead	Batch metadata operations
Block storage overhead	Pack small files together
Sync latency	Batch sync for bulk operations

Offline Sync

Loading diagram...

State	Behavior
Online	Real-time sync
Offline	Queue changes locally
Syncing	Upload queued changes
Conflict	Resolve and retry

Real-World Systems

Company	Notable Design Choice
Dropbox	4 MB blocks, custom sync protocol, extensive dedup
Google Drive	Integrated with Docs for real-time collab
OneDrive	Deep Windows/Office integration, differential sync
iCloud	Aggressive battery optimization for mobile
Box	Enterprise focus, granular permissions, compliance

Summary: Key Design Decisions

Decision	Options	Recommendation
File storage	Whole file, Blocks	4 MB blocks
Sync approach	Polling, Push	Push with polling fallback
Conflict resolution	Last write wins, Conflict copy	Conflict copy
Deduplication	None, Block-level	Block-level SHA256
Notification	Polling, WebSocket, Long poll	WebSocket + long poll fallback
Metadata DB	Document, Relational	Relational (MySQL/Postgres)

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Architecture​

Step 3: File Upload Design​

The Block Approach​

Block Size Trade-offs​

Upload Flow​

Deduplication​

Step 4: Sync Protocol​

Sync Challenges​

Change Detection​

Sync Flow​

Version Vectors​

Step 5: Metadata Design​

File Metadata Schema​

Why Separate Blocks from Files?​

Folder Structure​

Step 6: Conflict Resolution​

When Conflicts Happen​

Conflict Strategies​

Dropbox-Style Resolution​

Reducing Conflicts​

Step 7: Notification System​

Push Notifications​

Notification Payload​

Step 8: Storage Architecture​

Block Storage Layer​

Storage Durability​

Encryption​

Step 9: Sharing and Permissions​

Permission Model​

Sharing Flow​

Link Sharing​

Step 10: Handling Edge Cases​

Large Files (>1 GB)​

Many Small Files​

Offline Sync​

Real-World Systems​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Architecture

Step 3: File Upload Design

The Block Approach

Block Size Trade-offs

Upload Flow

Deduplication

Step 4: Sync Protocol

Sync Challenges

Change Detection

Sync Flow

Version Vectors

Step 5: Metadata Design

File Metadata Schema

Why Separate Blocks from Files?

Folder Structure

Step 6: Conflict Resolution

When Conflicts Happen

Conflict Strategies

Dropbox-Style Resolution

Reducing Conflicts

Step 7: Notification System

Push Notifications

Notification Payload

Step 8: Storage Architecture

Block Storage Layer

Storage Durability

Encryption

Step 9: Sharing and Permissions

Permission Model

Sharing Flow

Link Sharing

Step 10: Handling Edge Cases

Large Files (>1 GB)

Many Small Files

Offline Sync

Real-World Systems

Summary: Key Design Decisions