Slurm srun versus IBM CSM jsrun
Job Step Launch under Slurm and Cluster System Management (CSM)
Both Slurm and IBM's CSM provide a way to launch tasks (aka, Linux processes) of a user's application in parallel across resources allocated to the job. For Slurm, the command is srun; for CSM, the command is jsrun. This page presents their similarities and their differences.
The Beta 1 version of jsrun is currently installed on the CORAL Early Access systems. The enhancements that come in the Beta 2 version are marked in gold.
A job submission to the batch system (Slurm's sbatch or LSF's bsub) results in a resource allocation when the job runs. Both srun and jsrun rely on the batch system to allocate the computing resources for the job. (srun can do both resource allocation and job launch, but that is a difference discussed below). Both commands accept the file name of the user's application to launch as well as the number of tasks to launch. In response to an srun or jsrun invocation, both systems will select the resources (nodes, cores, memory, GPU's) on which to launch the tasks based on the arguments to the respective commands, and then spawn the requested number of tasks. A successful launch by either command is known as a job step.
The first difference is in terms. What Slurm refers to as "nodes", LSF refers to as "hosts". The discussion below will reflect that: srun launches tasks across nodes, jsrun launches tasks across hosts. In addition, Slurm defines the term CPU to generically refer to cores or hardware threads, depending on the node's configuration. Where Simultaneous Multithreading (SMT) is not available or disabled, "CPU" refers to a core. Where SMT is available and enabled, "CPU" refers to a hardware thread. There is no such distinction with jsrun; the processing element is the CPU.
The next difference is how the resources are selected. srun provides the (-N, --nodes) option to specify a number of nodes. The number of CPUs srun selects will by default be equal to the number of requested tasks. With those two esential specifications, srun will find the CPUs and nodes on which to launch the tasks. Other options to srun allow memory and GPUs to be independently specified.
jsrun's fundamental computing resource unit is known as the resource set. jsrun options allow the user to define the contents of a resource set (CPUs, memory and GPUs) and the system will find and select the requested number of resource sets from the hosts allocated to the job. The default mapping is one task per resource set.
Both commands provide a means to bind a task (process) to the CPU(s), memory, and GPU(s) selected for it. srun offers the --cpu_bind, --mem_bind, and --acel-bind options to do this, using a number of optional patterns. jsrun always confines tasks to their selected resource sets and offers the --bind option to bind tasks to CPUs.
Similarly, both commands provide the ability to specify the task to resource mapping pattern: block, cyclic or plane.
As an aside, Livermore Computing offers the mpibind SPANK plugin on their Linux clusters in place of srun's native --cpu_bind option. This offers a default mapping and binding selection that considers the OpenMP thread count and makes optimal mapping/binding decisions by default for most cases. It is unknown at this time whether mpibind will need to be ported to the jsrun environment or if jsrun will by default mimic the behavior of mpibind.
Typically, srun is invoked from within a batch script. The script is submitted to the Slurm scheduler using the sbatch command. Alternatively, users can invoke salloc. The salloc command will block until the job runs, and then return a prompt. From that prompt, users can invoke srun one or multiple times within the same resource allocation. This form of job submission is known as an interactive job.
Slurm allows the srun command to be invoked directly from the command line, i.e., outside of an sbatch or salloc job allocation. When that happens, srun will both request and receive a resource allocation and subsequently launch one job step on the allocated resources and return. The jsrun command does not offer this ability.
The following table presents the equivalent options for both commads. This is a curated list for comparison purposes. For the complete list see the srun and jsrun man pages.
|Number of tasks||-n, --ntasks||-p, --np|
|Allow more than one task per core||-O, --overcommit|
|Number of CPUs per task (default is 1)||-c, --cpus-per-task||(defined with resource set)|
|Number of nodes||-N, --nnodes||--rs_per_host 1 --nrs <x>|
|Required memory per job step||--mem||(defined with resource set)|
|Required memory per CPU||--mem-per-cpu||(defined with resource set. see --memory_per_rs)|
|Number of generic resources (e.g. GPUs)||--gres|
|Number of CPUs per resource set||N/A||-c, --cpus_per_rs|
|Amount of memory per resource set||N/A||-m, --memory_per_rs|
|Number of GPUs per resource set||N/A||-g, --gpus_per_rs|
|Number of resource sets||N/A||-n, --nrs|
|Number of resource sets per host||N/A||-r, --rs_per_host|
|Number of tasks per resource set||-a, --tasks_per_rs|
|Specify the task to resource mapping pattern (block, cyclic, plane or file-defined)||-m, --distribution||--launch_distribution|
|Bind tasks to allocated CPUs||--cpu_bind||-b, --bind|
|Bind tasks to allocated memory||--mem_bind||(confined to resource set)|
|GPU/MIC/NIC binding preference||--accel-bind||(confined to resource set)|
|Performance binding preference||--hint||-l, --latency_priority|
|Only select nodes with the specified features||-C, --constraint||N/A|
|Standard input redirection||-i, --input||-t, --stdio-input|
|Standard output redirection||-o, --output||-o, --stdio-stdout|
|Output distribution (one file per task or one file per job step)||filename format specifiers||-e, --stdio-mode|
|Standard error redirection||-e, --error||-k, --stdio-stderr|
|Kill entire job step if any task has non-zero exit||-K, --kill-on-bad-exit||-X, --exit_on_error|
|Wait specified time after first task exits before killing entire job step||-W, --wait|
|Do not kill job step when a node fails||-k, --no-kill||TBD|
|Run a prolog and/or epilog before/after every job step||--prolog, --epilog||-P, --pre_post_exec=<script info>|
|Do not run more than one task on its resources||--exclusive||--tasks_per_rs 1|
|Allow up to specified number of job steps to share resources||-s, --oversubscribe||-s, --shared|
|Do not launch job step if resources are unavailable within specified time limit||-I, --immediate||-i, --immediate|
|Associate a job name with this job step||-J, --job-name|
|Prepend a task ID to every line of output||-l, --label||--stdio-mode prepend|
|Run different applications under one job step||--multi-prog||-f, --appfile=<file>|
|Attach a pseudo terminal to task 0||--pty|
|Terminate job step with a single ^C (instead of two in succession)||-q, --quit-on-interrupt|
|Run job step on specific nodes within allocation||-w, --nodelist||-U, --use_resources=<filename>|
|Save sepcific resources of last job step to a file||-S, --save_resources=<filename>|
|Do not run job step on specific nodes||-x, --exclude||-U, --use_resources=<filename>|
|Start job step at a specific node offset within allocation||-r, --relative||-U, --use_resources=<filename>|
|Impose a time limit on the job step's run time||-t, --time|
|Turn off libc's buffering to standard out||-u, --unbuffered|
|Launch step in a different directory from where invoked||-D, --chdir||-h, --chdir|
|Pass environment variables to the execution environment||--export||-E, --env|
|Environment will evaluating and setting to occur just before exec||-F, --env_eval=<var=val>|
|Enable debug info to TotalView||-u, --debug|
|Use Spindle||-L, --use_spindle|
|Display version information and exit||-V, --version||-V, --version|
|Display help info||-h, --help||-?, --help|
|Display usage summary||--usage||--usage|
Environment variables are only propagated if they begin with JSM_, OMPI_, or OPAL_.