Large Cluster Administration Guide
This document contains SLURM administrator information specifically for clusters containing 1,024 nodes or more. Virtually all SLURM components have been validated (through emulation) for clusters containing up to 16,384 compute nodes. Getting good performance at that scale does require some tuning and this document should help you off to a good start. A working knowledge of SLURM should be considered a prerequisite for this material.
Node Selection Plugin (SelectType)
While allocating individual processors within a node is great for smaller clusters, the overhead of keeping track of the individual processors and memory within each node adds significant overhead. For best scalability, the consumable resource plugin (select/cons_res) is best avoided.
Job Accounting Plugin (JobAcctType)
Job accounting relies upon the slurmstepd daemon on each compute node periodically sampling data. This data collection will take compute cycles away from the application inducing what is known as system noise. For large parallel applications, this system noise can detract for application scalability. For optimal application performance, disabling job accounting is best (jobacct/none). Consider use of job completion records (JobCompType) for accounting purposes as this entails far less overhead. If job accounting is required, configure the sampling interval to a relatively large size (e.g. JobAcctFrequency=300). Some experimentation may also be required to deal with collisions on data transmission.
Node Configuration
While SLURM can track the amount of memory and disk space actually found on each compute node and use it for scheduling purposes, this entails extra overhead. Optimize performance by specifying the expected configuration using the available parameters (RealMemory, Procs, and TmpDisk). If the node is found to contain less resources than configured, it will be marked DOWN and not used. Also set the FastSchedule parameter. While SLURM can easily handle a heterogeneous cluster, configuring the nodes using the minimal number of lines in slurm.conf will both make for easier administration and better performance.
Timers
The configuration parameter SlurmdTimeout determines the interval at which slurmctld routinely communicates with slurmd. Communications occur at half the SlurmdTimeout value. The purpose of this is to determine when a compute node fails and thus should not be allocated work. Longer intervals decrease system noise on compute nodes (we do synchronize these requests across the cluster, but there will be some impact upon applications). For really large clusters, SlurmdTimeoutl values of 120 seconds or more are reasonable.
Last modified 28 January 2006