Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on Livermore Computing’s (LC) high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager.
An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs. The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).
The processing units on nodes are the cores. With the advent of Simultaneous Multithreading (SMT) architectures, single cores can have multiple hardware threads (sometimes known as hyper-threads). The processing elements are generically called a CPU. For systems without SMT, a CPU is a core. For systems with SMT available and enabled, a CPU is a hardware thread.
The Batch Scheduler and Resource Manager
The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time. When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources. This is also known as “running the job”.
The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job. Other conditions include the availability of a specific license or access to a specific file system.
Anatomy of a Batch Job
A batch job requests computing resources and specifies the application(s) to launch on those resources along with any input data/options and output directives. The user submits the job, usually in the form of a batch job script, to the batch scheduler.
The batch job script is composed of four main components:
- The interpreter used to execute the script
- “#” directives that convey default submission options.
- The setting of environment and/or script variables (if necessary)
- The application(s) to execute along with its input arguments and options.
Here is an example of a batch script that requests 8 nodes under the “science” charge account and launches 32 tasks of myApp across the 8 allocated nodes:
#!/bin/bash #SBATCH -N 8 #SBATCH -A science srun -N 8 -n 32 myApp
When the job is scheduled to run, the resource manager will execute the batch job script on the first node of the allocation.
The sbatch command is used to submit a batch script to Slurm. It is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections.
An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. The salloc command is used to submit an interactive job to Slurm. When the job runs, a command line prompt will appear and the user can launch their application(s) across the computing resources which have been allocated to the job.
An xterm job is a job that launches an xterm window when the job runs. The sxterm command is used to submit an xterm job to Slurm. When the job is runs, an xterm window appears on the desktop of the user who invoked sxterm. At that point, the user can launch their application(s) from the xterm window across the computing resources which have been allocated to the job.
Note: sxterm is a utility available on LLNL clusters that creates and submits a batch script that launches an xterm window. It is for the convenience of our users and is not part of the Slurm distribution.
For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment.
Slurm recognizes and provides a number of environment variables.
The first category of environment variables are those that Slurm inserts into the job's execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID). For the complete list, see the "OUTPUT ENVIRONMENT VARIABLES" section under the sbatch, salloc, and srun man pages.
The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the "INPUT ENVIRONMENT VARIABLES" section under the sbatch, salloc, and srun man pages.
Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.
Slurm merges the job's error and output by default and saves it to an output file with a name that includes the job ID (slurm-jobid.out). You can specify your own output and error files to the sbatch command using the -o and -e options respectively. Slurm will append the job's output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option.
Serial vs. Parallel jobs
Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.
Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.
LC dedicates at least one cluster per security zone to running serial jobs only. On these clusters, jobs can be allocated a minimum of one CPU and at most one node. Multiple jobs are allowed to run on one node.
Note: LC policy is to schedule no more than one job per core. So even when SMT is enabled and multiple tasks can run on a core, there will never be different jobs running on the same core.
LC provides many parallel clusters dedicated to running parallel jobs. On these cluster, the minimum allocation a job can request is one node, and only one job is permitted to run on a node at any given time.
Jobs and Job Steps
The job requests computing resources and when it runs, the scheduler selects and allocates those resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and xterm jobs.
When an application is launched using srun, it is called a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.
srun can be invoked sequentially or in parallel (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that was allocated to the job.
srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.
A typical cluster is typically busy running jobs and will probably not be able to run a job when it is submitted. So typically, the job is placed in a queue. Specific compute node resources are defined for every job queue. The Slurm node partition is synonymous with the term queue.
Each queue can be configured with a set of limits which specify the requirements for every job that can run in that queue. These limits include job size, wall clock limits, and the users who are allowed to run in that queue.
An LC convention is to have the following two queues on every cluster:
- pbatch - the production queue for running production jobs.
- pdebug - the debug queue providing quick turnaround for shorter and smaller jobs.
The sinfo command lists all the queues currently configured. scontrol show partition provides details about each queue.
The squeue command lists all the jobs currently in the system, one line per job.
Quality of Service (QoS)
Users can request a quality of service (QoS) for each job they submit (sbatch|salloc|srun --qos=qos) or it receives a QoS by default. The standard QoS’s defined for LC clusters are the following:
- normal (nominal priority and standard job size and wall clock time limits)
- expedite (higher job priority and an exemption from job size and wall clock time limits)
- exempt (normal job priority and an exemption from job size and wall clock time limits)
- standby (below normal job priority and an exemption from job size and wall clock time limits)
Only certain users are granted the permission to submit jobs with exempt and expedite QoS’s. Users are typically granted normal and standby privileges.
Users must request a charge (aka bank) account for each job they submit or have a valid charge account assigned by default. If the user is not assigned to any charge accounts, they cannot submit a job to the batch system. Computing resources allocated to a job are tracked and charged to the job’s specified charge account.
Jobs will be ordered in the queue of pending jobs based on a number of factors. The scheduler will always be looking to schedule the job that is at the top of the queue. The scheduler is also configured to schedule jobs lower in the queue if doing so does not delay the start of any higher priority queue. This is known as conservative backfill.
The active factors that contribute to a job’s priority can be seen by invoking the sprio command. These factors include:
- Fair-share: a number derived from the difference between the shares of the cluster that have been allotted to a user for a specific charge account and the usage accrued to the user and charge account, as well as any parent charge accounts.
For a more detailed description of the algorithms used to calculate the fair-share component of the job priority, see Fair Tree.
- Job size: a number proportional to the quantity of computing resources the job has requested.
- Age: a number proportional to the period of time that has elapsed since the job was submitted to the queue. Note: time during which queued jobs in a held state does not contribute to the age factor.
For a more detailed description of the algorithms for calculating job priority, see Multi-factor Priority.
Most of a job’s specifications can be seen by invoking scontrol show job jobid. More details about the job including the job script can be seen by adding the -d flag. A user is unable to see the script of the job of another user.
Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.
A job’s record remains in Slurm’s memory for 5 minutes after it completes. scontrol show job will return “Invalid job id specified” for a job that completed more than 5 minutes ago. At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.
Note: the sacct command requires going off-cluster to access the Slurm database. In an effort to keep the compute nodes as noise-free as possible, LC policy restricts the use of the sacct command to the login nodes only. sacct will not work if invoked from a compute node.
Modifying a Batch Job
Many of the batch job specifications can be modified after a batch job is submitted and before it runs. Typical fields that can be modified include the job size (number of nodes), queue (partition), and wall clock limit. Job specifications cannot be modified by the user once the job enters the Running state.
Beside displaying a job's specifications, the scontrol command is used to modify them. For example:
- scontrol show job jobid displays all of a job's characteristics
- scontrol -d show job jobid displays all of a job's characteristics, including the batch script
- scontrol update JobId=jobid Account=science changes the job's account to the science account
- scontrol update JobId=jobid Partition=pbatch changes the job's queue to the pbatch queue
Holding and Releasing a Batch Job
If a user's job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold jobid command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time. Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release jobid command.
Signaling and Cancelling a Batch Job
Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel jobid). The scancel command can also be used to terminate a running job. The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.
The -s option of the scancel command (scancel -s signal jobid) allows the user to issue any signal to a running job.
The basic job states are these:
- Pending - the job is in the queue, waiting to be scheduled
- Held - the job was submitted, but was put in the held state (ineligible to run)
- Running - the job has been granted an allocation. If it’s a batch job, the batch script has been run
- Complete - the job has completed successfully
- Timeout - the job was terminated for running longer than its wall clock limit
- Preempted - the running job was terminated to reassign its resources to a higher QoS job
- Failed - the job terminated with a non-zero status
- Node Fail - the job terminated after a compute node reported a problem
For the complete list, see the "JOB STATE CODES" section under the squeue man page.
A pending job can remain pending for a number of reasons:
- Dependency - the pending job is waiting for another job to complete
- Priority - the job is not high enough in the queue
- Resources - the job is high in the queue, but there are not enough resources to satisfy the job’s request
- Partition Down - the queue is currently closed to running any new jobs
For the complete list, see the "JOB REASON CODES" section under the squeue man page.
Displaying Computing Resources
As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs. The resources of each compute node can be seen by running the scontrol show node command. The characteristics of each queue can be seen by running the scontrol show partition command. Finally, a load summary report for each queue can be seen by running sinfo.
User Permissions and Limits
The charge accounts each user is permitted to use can be seen by running the sshare command. In addition, the limits associated with the use of those accounts can be seen by invoking sacctmgr show user user_name WithAssoc.
There are basically three layers of Slurm limits. The bottom and most fundamental set of limits are applied at the Slurm partition (queue) level.
On top of this are more targeted limits that can be applied at the association and Quality of Service (QoS) levels. Here it is possible to define limits that are more restrictive than the basal limits the partition imposes. By adding QoS flags, it is possible to allow jobs running under specific QoS’s to escape the limits imposed by the partition.
Quality of Service
|Maximum number of nodes per job||MaxNodes||MaxNodes||MaxNodes|
|Minimum number of nodes per job||MinNodes|
|Maximum number of nodes across all jobs running by user||MaxNodesPerUser|
|Maximum number of nodes across all jobs running under association/QoS||GrpNodes||GrpNodes|
|Maximum number of CPUs job can be allocated on any node||MaxCPUsPerNode|
|Maximum number of CPUs per job||MaxCPUs||MaxCPUs|
|Maximum number of CPUs across all jobs running by user||MaxCPUsPerUser|
|Maximum number of CPUs across all jobs running under association/QoS||GrpCPUs||GrpCPUs|
|Maximum memory job can be allocated on any CPU or node||MaxMemPerCPU/Node|
|Maximum length of time user's job can run||MaxTime||MaxWall||MaxWall|
|Maximum combined time for all jobs running under association/QoS||GrpWall||GrpWall|
|Maximum CPU*minutes user's job can run||MaxCPUMins||MaxCPUMins|
|Maximum combined CPU*minutes for all jobs running under association/QoS||GrpCPUMins||GrpCPUMins|
|Maximum number of submitted jobs by user||MaxSubmitJobs||MaxSubmitJobs|
|Maximum number of submitted jobs under association/QoS||GrpSubmitJobs||GrpSubmitJobs|
|Maximum number of all jobs running by user||MaxJobs||MaxJobs|
|Maximum number of all jobs running under association/QoS||GrpJobs||GrpJobs|
Job Statistics and Accounting
The sreport command provides aggregated usage reports by user and account over a specified period.
Time Remaining in an Allocation
If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.
The first means is to use the sbatch --signal=sig_num[@sig_time] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.
The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.
Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.