SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work.
SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler and Moab Cluster Suite. While other resource managers do exist, SLURM is unique in several respects:
- Its source code is freely available under the GNU General Public License.
- It is designed to operate in a heterogeneous cluster with up to 65,536 nodes.
- It is portable; written in C with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets. A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc.
- SLURM is highly tolerant of system failures, including failure of the node executing its control functions.
- It is simple enough for the motivated end user to understand its source and add functionality.
SLURM provides resource management on about 1000 computers worldwide, including many of the most powerful computers in the world:
- BlueGene/L at LLNL with 65,536 dual-processor compute nodes
- ASC Purple an IBM SP/AIX cluster at LLNL with 12,208 Power5 processors and a Federation switch
- MareNostrum a Linux cluster at Barcelona Supercomputer Center with 10,240 PowerPC processors and a Myrinet switch
- Peloton with 1,152 nodes each having four sockets with dual-core Opteron processors and an InfiniBand switch
- An IBM HPC Server at the University of Kentucky. This is a heterogeneous cluster with 128 Power5+ processors and 340 HS21 Blades each with dual-socket and dual-core Intel Woodcrest processors for a total of 1,488 cores connected with Infiniband switch
There are about 200 downloads of SLURM per month from LLNL's FTP server and SourceForge.net. As of March 2007, SLURM has been downloaded over 5000 times to over 500 distinct sites in 41 countries. SLURM is also actively being developed, distributed and supported by Hewlett-Packard and Bull.
Last modified 4 June 2007