Overview
The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM has been developed through the collaborative efforts of Lawrence Livermore National Laboratory (LLNL), Hewlett-Packard, Bull, Linux NetworX and many other contributors. HP distributes and supports SLURM as a component in their XC System Software.
Architecture
SLURM has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure. Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. The slurmd daemons provide fault-tolerant hierarchical communciations. There is an optional slurmdbd (Slurm DataBase Daemon) which can be used to record accounting information for multiple Slurm-managed clusters in a single database. User tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, squeue to report the status of jobs, sacct to get information about jobs and job steps that are running or have completed. The smap and sview commands graphically reports system and job status including network topology. There is also an administrative tool scontrol available to monitor and/or modify configuration and state information. APIs are available for all functions.

Figure 1. SLURM components
SLURM has a general-purpose plugin mechanism available to easily support various infrastructures. This permits a wide variety of SLURM configurations using a building block approach. These plugins presently include:
- Authentication of communications: authd, munge, or none (default).
- Checkpoint: AIX, OpenMPI, XLCH, or none.
- Cryptography: Munge or OpenSSL
- Job Accounting Gather: AIX, Linux, or none(default)
- Accounting Storage: text file (default if jobacct_gather != none), MySQL, PGSQL, SlurmDBD (Slurm Database Daemon) or none
- Job completion logging: text file, arbitrary script, MySQL, PGSQL, SlurmDBD, or none (default).
- MPI: LAM, MPICH1-P4, MPICH1-shmem, MPICH-GM, MPICH-MX, MVAPICH, OpenMPI and none (default, for most other versions of MPI including MPICH2 and MVAPICH2).
- Node selection: Bluegene (a 3-D torus interconnect BGL or BGP), consumable resources (to allocate individual processors and memory) or linear (to dedicate entire nodes).
- Process tracking (for signaling): AIX (using a kernel extension), Linux process tree hierarchy, process group ID, RMS (Quadrics Linux kernel patch), and SGI's Process Aggregates (PAGG).
- Scheduler: FIFO (First In First Out, default), backfill, gang (time-slicing for parallel jobs), The Maui Scheduler, and Moab Cluster Suite.
- Switch or interconnect: Quadrics (Elan3 or Elan4), Federation Federation (IBM High Performance Switch), or none (actually means nothing requiring special handling, such as Ethernet or Myrinet, default).
The entities managed by these SLURM daemons, shown in Figure 2, include nodes, the compute resource in SLURM, partitions, which group nodes into logical sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation. SLURM provides resource management for the processors allocated to a job, so that multiple job steps can be simultaneously submitted and queued until there are available resources within the job's allocation.

Figure 2. SLURM entities
Configurability
Node state monitored include: count of processors, size of real memory, size of temporary disk space, and state (UP, DOWN, etc.). Additional node information includes weight (preference in being allocated work) and features (arbitrary information such as processor speed or type). Nodes are grouped into partitions, which may contain overlapping nodes so they are best thought of as job queues. Partition information includes: name, list of associated nodes, state (UP or DOWN), maximum job time limit, maximum node count per job, group access list, priority (important if nodes are in multiple partitions) and shared node access policy with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2). Bit maps are used to represent nodes and scheduling decisions can be made by performing a small number of comparisons and a series of fast bit map manipulations. A sample (partial) SLURM configuration file follows.
# # Sample /etc/slurm.conf # ControlMachine=linux0001 BackupController=linux0002 # AuthType=auth/munge Epilog=/usr/local/slurm/sbin/epilog PluginDir=/usr/local/slurm/lib Prolog=/usr/local/slurm/sbin/prolog SlurmctldPort=7002 SlurmctldTimeout=120 SlurmdPort=7003 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmdTimeout=120 StateSaveLocation=/usr/local/slurm/slurm.state SwitchType=switch/elan TmpFS=/tmp # # Node Configurations # NodeName=DEFAULT Procs=4 TmpDisk=16384 State=IDLE NodeName=lx[0001-0002] State=DRAINED NodeName=lx[0003-8000] RealMemory=2048 Weight=2 NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video # # Partition Configurations # PartitionName=DEFAULT MaxTime=30 MaxNodes=2 PartitionName=login Nodes=lx[0001-0002] State=DOWN PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES PartitionName=class Nodes=lx[0031-0040] AllowGroups=students PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096 PartitionName=batch Nodes=lx[0041-9999]
Last modified 11 March 2008