Quick Start User Guide
Overview
The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work.
Architecture
As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communciations. The user commands include: salloc, sattach, sbatch, sbcast, scancel, sinfo, srun, smap, squeue, and scontrol. All of the commands can run anywhere in the cluster.

Figure 1. SLURM components
The entities managed by these SLURM daemons, shown in Figure 2, include nodes, the compute resource in SLURM, partitions, which group nodes into logical sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.

Figure 2. SLURM entities
Commands
Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.
salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
Examples
Execute /bin/hostname on four nodes (-N4). Include task numbers on the output (-l). The default partition will be used. One task per node will be used by default.
adev0: srun -N4 -l /bin/hostname 0: adev9 1: adev10 2: adev11 3: adev12
Execute /bin/hostname in four tasks (-n4). Include task numbers on the output (-l). The default partition will be used. One processor per task will be used by default (note that we don't specify a node count).
adev0: srun -n4 -l /bin/hostname 0: adev9 1: adev9 2: adev10 3: adev10
Submit the script my.script for later execution. Explicitly use the nodes adev9 and adev10 ("-w "adev[9-10]", note the use of a node range expression). We also explicitly state that the subsequent job steps will spawn four tasks each, which will insure that our allocation contains at least four processors (one processor per task to be launched). The output will appear in the file my.stdout ("-o my.stdout"). This script contains a timelimit for the job embedded within itself. Other options can be supplied as desired by using a prefix of "#SBATCH" followed by the option at the beginning of the script (before any commands to be executed in the script). Options supplied on the command line would override any options specified within the script. Note that my.script contains the command /bin/hostname that executed on the first node in the allocation (where the script runs) plus two job steps initiated using the srun command and executed sequentially.
adev0: cat my.script #!/bin/sh #SBATCH --time=1 /bin/hostname srun -l /bin/hostname srun -l /bin/pwd adev0: sbatch -n4 -w "adev[9-10]" -o my.stdout my.script sbatch: Submitted batch job 469 adev0: cat my.stdout adev9 0: adev9 1: adev9 2: adev10 3: adev10 0: /home/jette 1: /home/jette 2: /home/jette 3: /home/jette
Submit a job, get its status, and cancel it.
adev0: sbatch my.sleeper srun: jobid 473 submitted adev0: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 473 batch my.sleep jette R 00:00 1 adev9 adev0: scancel 473 adev0: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Get the SLURM partition and node status.
adev0: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up 00:30:00 8 idle adev[0-7] batch up 12:00:00 1 down adev8 12:00:00 7 idle adev[9-15]
MPI
MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementation.
- SLURM directly launches the tasks and performs initialization of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX, MVAPICH, MVAPICH2 and some MPICH1 modes).
- SLURM creates a resource allocation for the job and then mpirun launches tasks using SLURM's infrastructure (OpenMPI, LAM/MPI and HP-MPI).
- SLURM creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than SLURM, such as SSH or RSH (BlueGene MPI and some MPICH1 modes). These tasks initiated outside of SLURM's monitoring or control. SLURM's epilog should be configured to purge these tasks when the job's allocation is relinquished.
Instructions for using several varieties of MPI with SLURM are provided below.
Open MPI relies upon SLURM to allocate resources for the job and then mpirun to initiate the tasks. When using salloc command, mpirun's -nolocal option is recommended. For example:
$ salloc -n4 sh # allocates 4 processors and spawns shell for job > mpirun -np 4 -nolocal a.out > exit # exits shell spawned by initial salloc command
Note that any direct use of srun will only launch one task per node when the LAM/MPI plugin is used. To launch more than one task per node usng the srun command, the --mpi=none option will be required to explicitly disable the LAM/MPI plugin.
Quadrics MPI relies upon SLURM to allocate resources for the job and srun to initiate the tasks. One would build the MPI program in the normal manner then initiate it using a command line of this sort:
$ srun [options] <program> [program args]
LAM/MPI relies upon the SLURM salloc or sbatch command to allocate. In either case, specify the maximum number of tasks required for the job. Then execute the lamboot command to start lamd daemons. lamboot utilizes SLURM's srun command to launch these daemons. Do not directly execute the srun command to launch LAM/MPI tasks. For example:
$ salloc -n16 sh # allocates 16 processors and spawns shell for job > lamboot > mpirun -np 16 foo args 1234 foo running on adev0 (o) 2345 foo running on adev1 etc. > lamclean > lamhalt > exit # exits shell spawned by initial srun command
Note that any direct use of srun will only launch one task per node when the LAM/MPI plugin is configured as the default plugin. To launch more than one task per node usng the srun command, the --mpi=none option would be required to explicitly disable the LAM/MPI plugin if that is the system default.
HP-MPI uses the mpirun command with the -srun option to launch jobs. For example:
$MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out
MPICH2 jobs are launched using the srun command. Just link your program with SLURM's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). For example:
$ mpicc -L<path_to_slurm_lib> -lpmi ... $ srun -n20 a.outNOTES:
- Some MPICH2 functions are not currently supported by the PMI libary integrated with SLURM
- Set the environment variable PMI_DEBUG to a numeric value of 1 or higher for the PMI libary to print debugging information
MPICH-GM jobs can be launched directly by srun command. SLURM's mpichgm MPI plugin must be used to establish communications between the laucnhed tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mpichgm in slurm.conf or srun's --mpi=mpichgm option.
$ mpicc ... $ srun -n16 --mpi=mpichgm a.out
MPICH-MX jobs can be launched directly by srun command. SLURM's mpichmx MPI plugin must be used to establish communications between the laucnhed tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mpichmx in slurm.conf or srun's --mpi=mpichmx option.
$ mpicc ... $ srun -n16 --mpi=mpichmx a.out
MVAPICH jobs can be launched directly by srun command. SLURM's mvapich MPI plugin must be used to establish communications between the laucnhed tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=mvapich in slurm.conf or srun's --mpi=mvapich option.
$ mpicc ... $ srun -n16 --mpi=mvapich a.outNOTE: If MVAPICH is used in the shared memory model, with all tasks running on a single node, then use the mpich1_shmem MPI plugin instead.
NOTE (for system administrators): Configure PropagateResourceLimitsExcept=MEMLOCK in slurm.conf and start the slurmd daemons with an unlimited locked memory limit. For more details, see MVAPICH documentation for "CQ or QP Creation failure".
MVAPICH2 jobs can be launched directly by srun command. SLURM's none MPI plugin must be used to establish communications between the laucnhed tasks. This can be accomplished either using the SLURM configuration parameter MpiDefault=none in slurm.conf or srun's --mpi=none option. The program must also be linked with SLURM's implementation of the PMI library so that tasks can communicate host and port information at startup. (The system administrator can add these option to the mpicc and mpif77 commands directly, so the user will not need to bother). Do not use SLURM's MVAPICH plugin for MVAPICH2.
$ mpicc -L<path_to_slurm_lib> -lpmi ... $ srun -n16 --mpi=none a.out
BlueGene MPI relies upon SLURM to create the resource allocation and then uses the native mpirun command to launch tasks. Build a job script containing one or more invocations of the mpirun command. Then submit the script to SLURM using sbatch. For example:
$ sbatch -N512 my.script
Note that the node count specified with the -N option indicates the base partition count. See BlueGene User and Administrator Guide for more information.
MPICH1 development ceased in 2005. It is recommended that you convert to MPICH2 or some other MPI implementation. If you still want to use MPICH1, note that it has several different programming models. If you are using the shared memory model (DEFAULT_DEVICE=ch_shmem in the mpirun script), then initiate the tasks using the srun command with the --mpi=mpich1_shmem option.
$ srun -n16 --mpi=mpich1_shmem a.out
If you are using MPICH P4 (DEFAULT_DEVICE=ch_p4 in the mpirun script) and SLURM version 1.2.11 or newer, then it is recommended that you apply the patch in the SLURM distribution's file contribs/mpich1.slurm.patch. Follow directions within the file to rebuild MPICH. Applications must be relinked with the new library. Initiate tasks using the srun command with the --mpi=mpich1_p4 option.
$ srun -n16 --mpi=mpich1_p4 a.out
Note that SLURM launches one task per node and the MPICH library linked within your applications launches the other tasks with shared memory used for communications between them. The only real anomaly is that all output from all spawned tasks on a node appear to SLURM as coming from the one task that it launched. If the srun --label option is used, the task ID labels will be misleading.
Other MPICH1 programming models current rely upon the SLURM salloc or sbatch command to allocate resources. In either case, specify the maximum number of tasks required for the job. You may then need to build a list of hosts to be used and use that as an argument to the mpirun command. For example:
$ cat mpich.sh #!/bin/bash srun hostname -s | sort -u >slurm.hosts mpirun [options] -machinefile slurm.hosts a.out rm -f slurm.hosts $ sbatch -n16 mpich.sh sbatch: Submitted batch job 1234
Note that in this example, mpirun uses the rsh command to launch tasks. These tasks are not managed by SLURM since they are launched outside of its control.
Last modified 14 August 2007