Skip to main content

Design a Distributed Task Scheduler

Build a system that schedules and executes millions of tasks including one-time jobs, recurring cron jobs, and complex workflows with dependencies.

Related Concepts: Message Queues, Availability Patterns, Database Scaling

Step 1: Requirements and Scope

Functional Requirements

RequirementDescription
One-time tasksSchedule a task to run at a specific time
Recurring tasksCron-like scheduling (every day at 3am)
Task dependenciesTask B runs after Task A completes
RetriesAutomatic retry with backoff on failure
PriorityHigh-priority tasks run before low-priority
Task cancellationCancel scheduled or running tasks

Non-Functional Requirements

RequirementTargetRationale
ReliabilityTasks execute exactly onceDuplicate execution causes data issues
Latency< 1 second scheduling precisionSome tasks are time-sensitive
Scale100M+ scheduled tasksEnterprise workloads are large
Availability99.99%Missed tasks break downstream systems

Scale Estimation

  • Scheduled tasks: 100 million active tasks
  • Task executions: 10 million tasks per hour (2,800 TPS)
  • Task metadata size: 1KB average
  • Storage: 100GB for task metadata

Step 2: High-Level Architecture

Loading diagram...

Key Components:

  • Task API: CRUD operations for tasks
  • Scheduler Service: Finds tasks due for execution
  • Task Queues: Buffer between scheduler and workers
  • Worker Pool: Executes tasks, reports results
  • Task Database: Persistent storage for all task metadata

Step 3: Task Storage

Schema Design

The tasks table stores the following columns:

  • task_id: UUID primary key
  • user_id: Owner of the task
  • task_type: Identifies the handler to execute
  • payload: JSON data with task-specific parameters
  • scheduled_time: When the task should run (with timezone)
  • status: Current state (PENDING, RUNNING, COMPLETED, FAILED)
  • priority: Numeric priority for ordering
  • max_retries: Maximum retry attempts allowed
  • retry_count: Current number of retries attempted
  • created_at and updated_at: Timestamps for auditing

A partial index on scheduled_time, status, and priority (filtered to PENDING status only) is critical for query performance. This index is much smaller than a full index since it only includes pending tasks.

Time-Based Partitioning

With millions of tasks, a single table becomes slow. Partition by scheduled time:

Loading diagram...

Old partitions can be archived or dropped. Queries only touch relevant partitions.

Step 4: Scheduler Design

Polling vs Push

ApproachMechanismAdvantagesDisadvantages
PollingPeriodically query DB for due tasksSimple, reliableSlight delay, DB load
Time-wheelIn-memory timer wheelVery preciseMemory overhead, recovery complexity
PushExternal trigger when time arrivesReal-timeRequires external timer service

Recommendation: Polling with short intervals (1 second). Simple and reliable. The database partial index makes queries fast.

Handling Multiple Schedulers

Multiple scheduler instances for availability should not both pick up the same task.

Loading diagram...

FOR UPDATE SKIP LOCKED allows multiple schedulers to poll simultaneously. Each gets exclusive rows without blocking others.

Scheduler Algorithm

Every second, the scheduler queries for tasks where the scheduled time has passed and status is PENDING. Results are ordered by priority (highest first) then scheduled time (earliest first), limited to 1000 tasks per batch. The query uses row-level locking with skip-locked behavior so multiple schedulers can poll simultaneously without conflicts.

For each retrieved task, the scheduler updates its status to QUEUED and enqueues it to the appropriate priority queue for worker processing.

Step 5: Task Execution

Worker Design

Workers pull tasks from queues and execute them.

Loading diagram...

Workers should:

  1. Check high-priority queue first
  2. Acknowledge task only after completion
  3. Heartbeat while executing long tasks
  4. Report success or failure back

Exactly-Once Execution

Ensuring tasks run exactly once (not zero times, not twice) is the primary challenge.

Loading diagram...

Practical approach: At-least-once delivery with idempotent tasks. Most tasks can be made idempotent (check if work is already done before doing it).

Retry Strategy

Loading diagram...

Exponential backoff with jitter: The delay equals the base delay multiplied by 2 raised to the power of the attempt number, plus a random jitter value, capped at a maximum delay. For example, delays might progress as 1 second, 2 seconds, 4 seconds, 8 seconds, and so on, up to a maximum of 1 hour.

Step 6: Recurring Tasks (Cron)

Cron jobs require special handling. Storing infinite future instances is not feasible.

Cron Scheduling

Loading diagram...

When a cron task completes, immediately schedule the next occurrence. This handles catch-up for missed runs and drift.

Cron Storage

The cron_jobs table stores recurring job definitions with the following columns:

  • cron_id: UUID primary key
  • cron_expression: The cron schedule string (like "0 3 * * *" for daily at 3 AM)
  • task_type: Handler to execute
  • payload: Task parameters as JSON
  • timezone: User's timezone for schedule interpretation
  • last_run: When the job last executed
  • next_run: Computed next execution time
  • enabled: Whether the job is active

A separate process looks for cron jobs where next_run has passed and creates task instances for execution.

Step 7: Task Dependencies (DAGs)

Complex workflows have dependencies: Task C runs after both A and B complete.

Loading diagram...

Dependency Tracking

The task_dependencies table tracks relationships between tasks. Each row contains a task_id and the task it depends_on, with a composite primary key ensuring no duplicate dependencies.

When a task completes:

  1. Find all tasks that depend on it
  2. For each dependent, check if all its dependencies are complete
  3. If yes, move the dependent task to PENDING

DAG Execution

Loading diagram...

Step 8: Failure Handling

Stuck Tasks

Tasks can get stuck when a worker crashes mid-execution or a network partition occurs.

Loading diagram...

A reaper process finds tasks stuck in RUNNING for too long and resets them to PENDING for retry.

Worker Health

Workers should heartbeat. If a worker misses heartbeats:

  1. Mark worker as dead
  2. Reset all its in-progress tasks
  3. Do not route new tasks to it

Production Examples

SystemNotable Design Choice
AirflowDAG-based workflows, Python operators, web UI
CeleryTask queues with Redis/RabbitMQ, Python-native
TemporalDurable execution, workflow as code, replays
AWS Step FunctionsServerless, state machines, visual workflows
Kubernetes CronJobsContainer-native, uses k8s scheduling

Summary: Key Design Decisions

DecisionOptionsRecommendation
Task discoveryPolling, time-wheel, pushPolling with partial index
Multi-schedulerLeader election, lockingSKIP LOCKED for parallel polling
Delivery guaranteeAt-most-once, at-least-once, exactly-onceAt-least-once + idempotent tasks
Retry strategyFixed, exponential, customExponential backoff with jitter
Cron handlingStore all instances, generate on-demandGenerate next instance on completion