Design a Distributed Task Scheduler
Build a system that schedules and executes millions of tasks including one-time jobs, recurring cron jobs, and complex workflows with dependencies.
Related Concepts: Message Queues, Availability Patterns, Database Scaling
Step 1: Requirements and Scope
Functional Requirements
| Requirement | Description |
|---|---|
| One-time tasks | Schedule a task to run at a specific time |
| Recurring tasks | Cron-like scheduling (every day at 3am) |
| Task dependencies | Task B runs after Task A completes |
| Retries | Automatic retry with backoff on failure |
| Priority | High-priority tasks run before low-priority |
| Task cancellation | Cancel scheduled or running tasks |
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Reliability | Tasks execute exactly once | Duplicate execution causes data issues |
| Latency | < 1 second scheduling precision | Some tasks are time-sensitive |
| Scale | 100M+ scheduled tasks | Enterprise workloads are large |
| Availability | 99.99% | Missed tasks break downstream systems |
Scale Estimation
- Scheduled tasks: 100 million active tasks
- Task executions: 10 million tasks per hour (2,800 TPS)
- Task metadata size: 1KB average
- Storage: 100GB for task metadata
Step 2: High-Level Architecture
Key Components:
- Task API: CRUD operations for tasks
- Scheduler Service: Finds tasks due for execution
- Task Queues: Buffer between scheduler and workers
- Worker Pool: Executes tasks, reports results
- Task Database: Persistent storage for all task metadata
Step 3: Task Storage
Schema Design
The tasks table stores the following columns:
- task_id: UUID primary key
- user_id: Owner of the task
- task_type: Identifies the handler to execute
- payload: JSON data with task-specific parameters
- scheduled_time: When the task should run (with timezone)
- status: Current state (PENDING, RUNNING, COMPLETED, FAILED)
- priority: Numeric priority for ordering
- max_retries: Maximum retry attempts allowed
- retry_count: Current number of retries attempted
- created_at and updated_at: Timestamps for auditing
A partial index on scheduled_time, status, and priority (filtered to PENDING status only) is critical for query performance. This index is much smaller than a full index since it only includes pending tasks.
Time-Based Partitioning
With millions of tasks, a single table becomes slow. Partition by scheduled time:
Old partitions can be archived or dropped. Queries only touch relevant partitions.
Step 4: Scheduler Design
Polling vs Push
| Approach | Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| Polling | Periodically query DB for due tasks | Simple, reliable | Slight delay, DB load |
| Time-wheel | In-memory timer wheel | Very precise | Memory overhead, recovery complexity |
| Push | External trigger when time arrives | Real-time | Requires external timer service |
Recommendation: Polling with short intervals (1 second). Simple and reliable. The database partial index makes queries fast.
Handling Multiple Schedulers
Multiple scheduler instances for availability should not both pick up the same task.
FOR UPDATE SKIP LOCKED allows multiple schedulers to poll simultaneously. Each gets exclusive rows without blocking others.
Scheduler Algorithm
Every second, the scheduler queries for tasks where the scheduled time has passed and status is PENDING. Results are ordered by priority (highest first) then scheduled time (earliest first), limited to 1000 tasks per batch. The query uses row-level locking with skip-locked behavior so multiple schedulers can poll simultaneously without conflicts.
For each retrieved task, the scheduler updates its status to QUEUED and enqueues it to the appropriate priority queue for worker processing.
Step 5: Task Execution
Worker Design
Workers pull tasks from queues and execute them.
Workers should:
- Check high-priority queue first
- Acknowledge task only after completion
- Heartbeat while executing long tasks
- Report success or failure back
Exactly-Once Execution
Ensuring tasks run exactly once (not zero times, not twice) is the primary challenge.
Practical approach: At-least-once delivery with idempotent tasks. Most tasks can be made idempotent (check if work is already done before doing it).
Retry Strategy
Exponential backoff with jitter: The delay equals the base delay multiplied by 2 raised to the power of the attempt number, plus a random jitter value, capped at a maximum delay. For example, delays might progress as 1 second, 2 seconds, 4 seconds, 8 seconds, and so on, up to a maximum of 1 hour.
Step 6: Recurring Tasks (Cron)
Cron jobs require special handling. Storing infinite future instances is not feasible.
Cron Scheduling
When a cron task completes, immediately schedule the next occurrence. This handles catch-up for missed runs and drift.
Cron Storage
The cron_jobs table stores recurring job definitions with the following columns:
- cron_id: UUID primary key
- cron_expression: The cron schedule string (like "0 3 * * *" for daily at 3 AM)
- task_type: Handler to execute
- payload: Task parameters as JSON
- timezone: User's timezone for schedule interpretation
- last_run: When the job last executed
- next_run: Computed next execution time
- enabled: Whether the job is active
A separate process looks for cron jobs where next_run has passed and creates task instances for execution.
Step 7: Task Dependencies (DAGs)
Complex workflows have dependencies: Task C runs after both A and B complete.
Dependency Tracking
The task_dependencies table tracks relationships between tasks. Each row contains a task_id and the task it depends_on, with a composite primary key ensuring no duplicate dependencies.
When a task completes:
- Find all tasks that depend on it
- For each dependent, check if all its dependencies are complete
- If yes, move the dependent task to PENDING
DAG Execution
Step 8: Failure Handling
Stuck Tasks
Tasks can get stuck when a worker crashes mid-execution or a network partition occurs.
A reaper process finds tasks stuck in RUNNING for too long and resets them to PENDING for retry.
Worker Health
Workers should heartbeat. If a worker misses heartbeats:
- Mark worker as dead
- Reset all its in-progress tasks
- Do not route new tasks to it
Production Examples
| System | Notable Design Choice |
|---|---|
| Airflow | DAG-based workflows, Python operators, web UI |
| Celery | Task queues with Redis/RabbitMQ, Python-native |
| Temporal | Durable execution, workflow as code, replays |
| AWS Step Functions | Serverless, state machines, visual workflows |
| Kubernetes CronJobs | Container-native, uses k8s scheduling |
Summary: Key Design Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Task discovery | Polling, time-wheel, push | Polling with partial index |
| Multi-scheduler | Leader election, locking | SKIP LOCKED for parallel polling |
| Delivery guarantee | At-most-once, at-least-once, exactly-once | At-least-once + idempotent tasks |
| Retry strategy | Fixed, exponential, custom | Exponential backoff with jitter |
| Cron handling | Store all instances, generate on-demand | Generate next instance on completion |