Design a Distributed Task Scheduler

Build a system that schedules and executes millions of tasks including one-time jobs, recurring cron jobs, and complex workflows with dependencies.

Related Concepts: Message Queues, Availability Patterns, Database Scaling

Step 1: Requirements and Scope

Functional Requirements

Requirement	Description
One-time tasks	Schedule a task to run at a specific time
Recurring tasks	Cron-like scheduling (every day at 3am)
Task dependencies	Task B runs after Task A completes
Retries	Automatic retry with backoff on failure
Priority	High-priority tasks run before low-priority
Task cancellation	Cancel scheduled or running tasks

Non-Functional Requirements

Requirement	Target	Rationale
Reliability	Tasks execute exactly once	Duplicate execution causes data issues
Latency	< 1 second scheduling precision	Some tasks are time-sensitive
Scale	100M+ scheduled tasks	Enterprise workloads are large
Availability	99.99%	Missed tasks break downstream systems

Scale Estimation

Scheduled tasks: 100 million active tasks
Task executions: 10 million tasks per hour (2,800 TPS)
Task metadata size: 1KB average
Storage: 100GB for task metadata

Step 2: High-Level Architecture

Loading diagram...

Key Components:

Task API: CRUD operations for tasks
Scheduler Service: Finds tasks due for execution
Task Queues: Buffer between scheduler and workers
Worker Pool: Executes tasks, reports results
Task Database: Persistent storage for all task metadata

Step 3: Task Storage

Schema Design

The tasks table stores the following columns:

task_id: UUID primary key
user_id: Owner of the task
task_type: Identifies the handler to execute
payload: JSON data with task-specific parameters
scheduled_time: When the task should run (with timezone)
status: Current state (PENDING, RUNNING, COMPLETED, FAILED)
priority: Numeric priority for ordering
max_retries: Maximum retry attempts allowed
retry_count: Current number of retries attempted
created_at and updated_at: Timestamps for auditing

A partial index on scheduled_time, status, and priority (filtered to PENDING status only) is critical for query performance. This index is much smaller than a full index since it only includes pending tasks.

Time-Based Partitioning

With millions of tasks, a single table becomes slow. Partition by scheduled time:

Loading diagram...

Old partitions can be archived or dropped. Queries only touch relevant partitions.

Step 4: Scheduler Design

Polling vs Push

Approach	Mechanism	Advantages	Disadvantages
Polling	Periodically query DB for due tasks	Simple, reliable	Slight delay, DB load
Time-wheel	In-memory timer wheel	Very precise	Memory overhead, recovery complexity
Push	External trigger when time arrives	Real-time	Requires external timer service

Recommendation: Polling with short intervals (1 second). Simple and reliable. The database partial index makes queries fast.

Handling Multiple Schedulers

Multiple scheduler instances for availability should not both pick up the same task.

Loading diagram...

FOR UPDATE SKIP LOCKED allows multiple schedulers to poll simultaneously. Each gets exclusive rows without blocking others.

Scheduler Algorithm

Every second, the scheduler queries for tasks where the scheduled time has passed and status is PENDING. Results are ordered by priority (highest first) then scheduled time (earliest first), limited to 1000 tasks per batch. The query uses row-level locking with skip-locked behavior so multiple schedulers can poll simultaneously without conflicts.

For each retrieved task, the scheduler updates its status to QUEUED and enqueues it to the appropriate priority queue for worker processing.

Step 5: Task Execution

Worker Design

Workers pull tasks from queues and execute them.

Loading diagram...

Workers should:

Check high-priority queue first
Acknowledge task only after completion
Heartbeat while executing long tasks
Report success or failure back

Exactly-Once Execution

Ensuring tasks run exactly once (not zero times, not twice) is the primary challenge.

Loading diagram...

Practical approach: At-least-once delivery with idempotent tasks. Most tasks can be made idempotent (check if work is already done before doing it).

Retry Strategy

Loading diagram...

Exponential backoff with jitter: The delay equals the base delay multiplied by 2 raised to the power of the attempt number, plus a random jitter value, capped at a maximum delay. For example, delays might progress as 1 second, 2 seconds, 4 seconds, 8 seconds, and so on, up to a maximum of 1 hour.

Step 6: Recurring Tasks (Cron)

Cron jobs require special handling. Storing infinite future instances is not feasible.

Cron Scheduling

Loading diagram...

When a cron task completes, immediately schedule the next occurrence. This handles catch-up for missed runs and drift.

Cron Storage

The cron_jobs table stores recurring job definitions with the following columns:

cron_id: UUID primary key
cron_expression: The cron schedule string (like "0 3 * * *" for daily at 3 AM)
task_type: Handler to execute
payload: Task parameters as JSON
timezone: User's timezone for schedule interpretation
last_run: When the job last executed
next_run: Computed next execution time
enabled: Whether the job is active

A separate process looks for cron jobs where next_run has passed and creates task instances for execution.

Step 7: Task Dependencies (DAGs)

Complex workflows have dependencies: Task C runs after both A and B complete.

Loading diagram...

Dependency Tracking

The task_dependencies table tracks relationships between tasks. Each row contains a task_id and the task it depends_on, with a composite primary key ensuring no duplicate dependencies.

When a task completes:

Find all tasks that depend on it
For each dependent, check if all its dependencies are complete
If yes, move the dependent task to PENDING

DAG Execution

Loading diagram...

Step 8: Failure Handling

Stuck Tasks

Tasks can get stuck when a worker crashes mid-execution or a network partition occurs.

Loading diagram...

A reaper process finds tasks stuck in RUNNING for too long and resets them to PENDING for retry.

Worker Health

Workers should heartbeat. If a worker misses heartbeats:

Mark worker as dead
Reset all its in-progress tasks
Do not route new tasks to it

Production Examples

System	Notable Design Choice
Airflow	DAG-based workflows, Python operators, web UI
Celery	Task queues with Redis/RabbitMQ, Python-native
Temporal	Durable execution, workflow as code, replays
AWS Step Functions	Serverless, state machines, visual workflows
Kubernetes CronJobs	Container-native, uses k8s scheduling

Summary: Key Design Decisions

Decision	Options	Recommendation
Task discovery	Polling, time-wheel, push	Polling with partial index
Multi-scheduler	Leader election, locking	SKIP LOCKED for parallel polling
Delivery guarantee	At-most-once, at-least-once, exactly-once	At-least-once + idempotent tasks
Retry strategy	Fixed, exponential, custom	Exponential backoff with jitter
Cron handling	Store all instances, generate on-demand	Generate next instance on completion

Step 1: Requirements and Scope​

Functional Requirements​

Non-Functional Requirements​

Scale Estimation​

Step 2: High-Level Architecture​

Step 3: Task Storage​

Schema Design​

Time-Based Partitioning​

Step 4: Scheduler Design​

Polling vs Push​

Handling Multiple Schedulers​

Scheduler Algorithm​

Step 5: Task Execution​

Worker Design​

Exactly-Once Execution​

Retry Strategy​

Step 6: Recurring Tasks (Cron)​

Cron Scheduling​

Cron Storage​

Step 7: Task Dependencies (DAGs)​

Dependency Tracking​

DAG Execution​

Step 8: Failure Handling​

Stuck Tasks​

Worker Health​

Production Examples​

Summary: Key Design Decisions​

Table of Contents

Step 1: Requirements and Scope

Functional Requirements

Non-Functional Requirements

Scale Estimation

Step 2: High-Level Architecture

Step 3: Task Storage

Schema Design

Time-Based Partitioning

Step 4: Scheduler Design

Polling vs Push

Handling Multiple Schedulers

Scheduler Algorithm

Step 5: Task Execution

Worker Design

Exactly-Once Execution

Retry Strategy

Step 6: Recurring Tasks (Cron)

Cron Scheduling

Cron Storage

Step 7: Task Dependencies (DAGs)

Dependency Tracking

DAG Execution

Step 8: Failure Handling

Stuck Tasks

Worker Health

Production Examples

Summary: Key Design Decisions