Preemption
SLURM version 1.2 and earlier supported dedication of resources to jobs based on a simple "first come, first served" policy with backfill. Beginning in SLURM version 1.3, priority-based preemption is supported. Preemption is the act of suspending one or more "low-priority" jobs to let a "high-priority" job run uninterrupted until it completes. Preemption provides the ability to prioritize the workload on a cluster.
The SLURM version 1.3.1 sched/gang plugin supports preemption. When configured, the plugin monitors each of the partitions in SLURM. If a new job in a high-priority partition has been allocated to resources that have already been allocated to one or more existing jobs from lower priority partitions, the plugin respects the partition priority and suspends the low-priority job(s). The low-priority job(s) remain suspended until the job from the high-priority partition completes. Once the high-priority job completes then the low-priority job(s) are resumed.
Configuration
There are several important configuration parameters relating to preemption:
- SelectType: The SLURM sched/gang plugin supports nodes allocated by the select/linear plugin and socket/core/CPU resources allocated by the select/cons_res plugin. See Future Work below for more information on "preemption with consumable resources".
- SelectTypeParameter: Since resources will be getting overallocated with jobs (the preempted job will remain in memory), the resource selection plugin should be configured to track the amount of memory used by each job to ensure that memory page swapping does not occur. When select/linear is chosen, we recommend setting SelectTypeParameter=CR_Memory. When select/cons_res is chosen, we recommend including Memory as a resource (ex. SelectTypeParameter=CR_Core_Memory).
- DefMemPerCPU: Since job requests may not explicitly specify a memory requirement, we also recommend configuring DefMemPerCPU (default memory per allocated CPU) or DefMemPerNode (default memory per allocated node). It may also be desirable to configure MaxMemPerCPU (maximum memory per allocated CPU) or MaxMemPerNode (maximum memory per allocated node) in slurm.conf. Users can use the --mem or --mem-per-cpu option at job submission time to specify their memory requirements.
- JobAcctGatherType and JobAcctGatherFrequency: If you wish to enforce memory limits, accounting must be enabled using the JobAcctGatherType and JobAcctGatherFrequency parameters. If accounting is enabled and a job exceeds its configured memory limits, it will be canceled in order to prevent it from adversely effecting other jobs sharing the same resources.
- SchedulerType: Configure the sched/gang plugin by setting SchedulerType=sched/gang in slurm.conf.
- Priority: Configure the partition's Priority setting relative to other partitions to control the preemptive behavior. If two jobs from two different partitions are allocated to the same resources, the job in the partition with the greater Priority value will preempt the job in the partition with the lesser Priority value. If the Priority values of the two partitions are equal then no preemption will occur, and the two jobs will run simultaneously on the same resources. The default Priority value is 1.
-
Shared: Configure the partitions Shared setting to
FORCE for all partitions that will preempt or that will be preempted. The
FORCE setting is required to enable the select plugins to overallocate
resources. Jobs submitted to a partition that does not share it's resources will
not preempt other jobs, nor will those jobs be preempted. Instead those jobs
will wait until the resources are free for non-shared use by each job.
The FORCE option now supports an additional parameter that controls how many jobs can share a resource within the partition (FORCE[:max_share]). By default the max_share value is 4. To disable timeslicing within a partition but enable preemption with other partitions, set Shared=FORCE:1. - SchedulerTimeSlice: The default timeslice interval is 30 seconds. To change this duration, set SchedulerTimeSlice to the desired interval (in seconds) in slurm.conf. For example, to set the timeslice interval to one minute, set SchedulerTimeSlice=60. Short values can increase the overhead of gang scheduling. This parameter is only relevant if timeslicing within a partition will be configured. Preemption and timeslicing can occur at the same time.
To enable preemption after making the configuration changes described above, restart SLURM if it is already running. Any change to the plugin settings in SLURM requires a full restart of the daemons. If you just change the partition Priority or Shared setting, this can be updated with scontrol reconfig.
Preemption Design and Operation
When enabled, the sched/gang plugin keeps track of the resources allocated to all jobs. For each partition an "active bitmap" is maintained that tracks all concurrently running jobs in the SLURM cluster. Each partition also maintains a job list for that partition, and a list of "shadow" jobs. These "shadow" jobs are running jobs from higher priority partitions that "cast shadows" on the active bitmaps of the lower priority partitions.
Each time a new job is allocated to resources in a partition and begins running, the sched/gang plugin adds a "shadow" of this job to all lower priority partitions. The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first. Any existing jobs that were replaced by one or more "shadow" jobs are suspended (preempted). Conversely, when a high-priority running job completes, it's "shadow" goes away and the active bitmaps of the lower priority partitions are rebuilt to see if any suspended jobs can be resumed.
The gang scheduler plugin is primarily designed to be reactive to the resource allocation decisions made by the Selector plugins. This is why Shared=FORCE is required in each partition. The Shared=FORCE setting enables the select/linear and select/cons_res plugins to overallocate the resources between partitions. This keeps all of the node placement logic in the select plugins, and leaves the gang scheduler in charge of controlling which jobs should run on the overallocated resources.
The sched/gang plugin suspends jobs via the same internal functions that support scontrol suspend and scontrol resume. A good way to observe the act of preemption is by running watch squeue in a terminal window.
A Simple Example
The following example is configured with select/linear, sched/gang, and Shared=FORCE:1. This example takes place on a cluster of 5 nodes:
[user@n16 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST active* up infinite 5 idle n[12-16] hipri up infinite 5 idle n[12-16]
Here are the Partition settings:
[user@n16 ~]$ grep PartitionName /shared/slurm/slurm.conf PartitionName=active Priority=1 Default=YES Shared=FORCE:1 Nodes=n[12-16] PartitionName=hipri Priority=2 Shared=FORCE:1 Nodes=n[12-16]
The runit.pl script launches a simple load-generating app that runs for the given number of seconds. Submit 5 single-node runit.pl jobs to run on all nodes:
[user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 485 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 486 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 487 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 488 [user@n16 ~]$ sbatch -N1 ./runit.pl 300 sbatch: Submitted batch job 489 [user@n16 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 485 active runit.pl user R 0:06 1 n12 486 active runit.pl user R 0:06 1 n13 487 active runit.pl user R 0:05 1 n14 488 active runit.pl user R 0:05 1 n15 489 active runit.pl user R 0:04 1 n16
Now submit a short-running 3-node job to the hipri partition:
[user@n16 ~]$ sbatch -N3 -p hipri ./runit.pl 30 sbatch: Submitted batch job 490 [user@n16 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 488 active runit.pl user R 0:29 1 n15 489 active runit.pl user R 0:28 1 n16 485 active runit.pl user S 0:27 1 n12 486 active runit.pl user S 0:27 1 n13 487 active runit.pl user S 0:26 1 n14 490 hipri runit.pl user R 0:03 3 n[12-14]
Job 490 in the hipri partition preempted jobs 485, 486, and 487 from the active partition. Jobs 488 and 489 in the active partition remained running.
This state persisted until job 490 completed, at which point the preempted jobs were resumed:
[user@n16 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 485 active runit.pl user R 0:30 1 n12 486 active runit.pl user R 0:30 1 n13 487 active runit.pl user R 0:29 1 n14 488 active runit.pl user R 0:59 1 n15 489 active runit.pl user R 0:58 1 n16
Future Work
Preemption with consumable resources: This implementation of preemption relies on intelligent job placement by the select plugins. As of SLURM 1.3.1 the consumable resource select/cons_res plugin still needs additional enhancements to the job placement algorithm before it's preemption support can be considered "competent". The mechanics of preemption work, but the placement of preemptive jobs relative to any low-priority jobs may not be optimal. The work to improve the placement of preemptive jobs relative to existing jobs is currently in-progress.
Requeue a preempted job: In some situations is may be desirable to requeue a low-priority job rather than suspend it. Suspending a job leaves the job in memory. Requeuing a job involves terminating the job and resubmitting it again. This will be investigated at some point in the future. Requeuing a preempted job may make the most sense with Shared=NO partitions.
Last modified 7 July 2008