Slurm srun Versus IBM CSM jsrun
Simple Job Step Launch on CORAL systems
Not all users will need the detailed job step / task management options provided by CSM's jsrun. LLNL has developed a jsrun wrapper, lrun, that simplifies job step launch and provides a good task / thread placement and binding for most cases. lrun can also take jsrun options if more detailed controls or options are needed. Use lrun with:
% lrun -T<ntasks_per_node> | -p<ntasks> [-N<nnodes] [--nolbind] [<jsrun_options>] <app> [<app_args>]
Note that no spaces are allowed within options for lrun. i.e. use '-p4' not '-p 4'. If -N is not given, all nodes in the allocation (from bsub) will be used. The --nolbind option will disable the built-in lrun mpibind task / thread affinity.
Job Step Launch under Slurm and Cluster System Management (CSM)
Both Slurm and IBM's CSM provide a way to launch tasks (aka, Linux processes) of a user's application in parallel across resources allocated to the job. For Slurm, the command is srun; for CSM, the command is jsrun. This page presents their similarities and their differences.
A job submission to the batch system (Slurm's sbatch or LSF's bsub) results in a resource allocation when the job runs. Both srun and jsrun rely on the batch system to allocate the computing resources for the job. (srun can do both resource allocation and job launch, but that is a difference discussed below). Both commands accept the file name of the user's application to launch as well as the number of tasks to launch. In response to an srun or jsrun invocation, both systems will select the resources (nodes, cores, memory, GPU's) on which to launch the tasks based on the arguments to the respective commands, and then spawn the requested number of tasks. A successful launch by either command is known as a job step.
The first difference is in terms. What Slurm refers to as "nodes", LSF refers to as "hosts". The discussion below will reflect that: srun launches tasks across nodes, jsrun launches tasks across hosts. In addition, Slurm defines the term CPU to generically refer to cores or hardware threads, depending on the node's configuration. Where Simultaneous Multithreading (SMT) is not available or disabled, "CPU" refers to a core. Where SMT is available and enabled, "CPU" refers to a hardware thread. There is no such distinction with jsrun; the processing element is the CPU.
The next difference is how the resources are selected. srun provides the (-N, --nodes) option to specify a number of nodes. The number of CPUs srun selects will by default be equal to the number of requested tasks. With those two essential specifications, srun will find the CPUs and nodes on which to launch the tasks. Other options to srun allow memory and GPUs to be independently specified.
jsrun's fundamental computing resource unit is known as the resource set. jsrun options allow the user to define the contents of a resource set (CPUs, memory and GPUs) and the system will find and select the requested number of resource sets from the hosts allocated to the job. The default mapping is one task per resource set.
Both commands provide a means to bind a task (process) to the CPU(s), memory, and GPU(s) selected for it. srun offers the --cpu_bind, --mem_bind, and --acel-bind options to do this, using a number of optional patterns. jsrun always confines tasks to their selected resource sets and offers the --bind option to bind tasks to CPUs.
Similarly, both commands provide the ability to specify the task to resource mapping pattern: block, cyclic or plane.
As an aside, Livermore Computing offers the mpibind SPANK plugin on their Linux clusters in place of srun's native --cpu_bind option. This offers a default mapping and binding selection that considers the OpenMP thread count and makes optimal mapping/binding decisions by default for most cases. The lrun command provides mpibind bindings on the IBM CORAL systems.
Typically, srun is invoked from within a batch script. The script is submitted to the Slurm scheduler using the sbatch command. Alternatively, users can invoke salloc. The salloc command will block until the job runs, and then return a prompt. From that prompt, users can invoke srun one or multiple times within the same resource allocation. This form of job submission is known as an interactive job.
Slurm allows the srun command to be invoked directly from the command line, i.e., outside of an sbatch or salloc job allocation. When that happens, srun will both request and receive a resource allocation and subsequently launch one job step on the allocated resources and return. The jsrun command does not offer this ability.
The following table presents the equivalent options for srun, jsrun, and lrun. This is a curated list for comparison purposes. For the complete list see the srun and jsrun man pages.
|Number of tasks||-n, --ntasks||-p, --np||-n|
|Number of tasks per node||--ntasks-per-node||-T|
|Allow more than one task per core||-O, --overcommit|
|Number of CPUs per task (default is 1)||-c, --cpus-per-task||(defined with resource set)|
|Number of nodes||-N, --nnodes||--rs_per_host 1 --nrs <x>||-N|
|Required memory per job step||--mem||(defined with resource set)|
|Required memory per CPU||--mem-per-cpu||(defined with resource set. see --memory_per_rs)|
|Number of generic resources (e.g. GPUs)||--gres|
|Number of CPUs per resource set||N/A||-c, --cpus_per_rs|
|Amount of memory per resource set||N/A||-m, --memory_per_rs|
|Number of GPUs per resource set||N/A||-g, --gpus_per_rs|
|Number of resource sets||N/A||-n, --nrs|
|Number of resource sets per host||N/A||-r, --rs_per_host|
|Number of tasks per resource set||-a, --tasks_per_rs|
|Specify the task to resource mapping pattern (block, cyclic, plane or file-defined)||-m, --distribution||--launch_distribution||(use jsrun option)|
|Bind tasks to allocated CPUs||--cpu_bind||-b, --bind||(use jsrun option)|
|Bind tasks to allocated memory||--mem_bind||(confined to resource set)|
|GPU/MIC/NIC binding preference||--accel-bind||(confined to resource set)|
|Performance binding preference||--hint||-l, --latency_priority||(use jsrun option)|
|Only select nodes with the specified features||-C, --constraint||N/A|
|Standard input redirection||-i, --input||-t, --stdio-input||(use jsrun option)|
|Standard output redirection||-o, --output||-o, --stdio-stdout||(use jsrun option)|
|Output distribution (one file per task or one file per job step)||filename format specifiers||-e, --stdio-mode||(use jsrun option)|
|Standard error redirection||-e, --error||-k, --stdio-stderr||(use jsrun option)|
|Kill entire job step if any task has non-zero exit||-K, --kill-on-bad-exit||-X, --exit_on_error||(use jsrun option)|
|Wait specified time after first task exits before killing entire job step||-W, --wait|
|Do not kill job step when a node fails||-k, --no-kill||TBD|
|Run a prolog and/or epilog before/after every job step||--prolog, --epilog||-P, --pre_post_exec=<script info>||(use jsrun option)|
|Do not run more than one task on its resources||--exclusive||--tasks_per_rs 1|
|Allow up to specified number of job steps to share resources||-s, --oversubscribe||-s, --shared||(use jsrun option)|
|Do not launch job step if resources are unavailable within specified time limit||-I, --immediate||-i, --immediate||(use jsrun option)|
|Associate a job name with this job step||-J, --job-name|
|Prepend a task ID to every line of output||-l, --label||--stdio-mode prepend||(use jsrun option)|
|Run different applications under one job step||--multi-prog||-f, --appfile=<file>|
|Attach a pseudo terminal to task 0||--pty|
|Terminate job step with a single ^C (instead of two in succession)||-q, --quit-on-interrupt|
|Run job step on specific nodes within allocation||-w, --nodelist||-U, --use_resources=<filename>||(use jsrun option)|
|Save specific resources of last job step to a file||-S, --save_resources=<filename>||(use jsrun option)|
|Do not run job step on specific nodes||-x, --exclude||-U, --use_resources=<filename>||(use jsrun option)|
|Start job step at a specific node offset within allocation||-r, --relative||-U, --use_resources=<filename>||(use jsrun option)|
|Impose a time limit on the job step's run time||-t, --time|
|Turn off libc's buffering to standard out||-u, --unbuffered|
|Launch step in a different directory from where invoked||-D, --chdir||-h, --chdir||(use jsrun option)|
|Pass environment variables to the execution environment||--export||-E, --env||(use jsrun option)|
|Environment will evaluating and setting to occur just before exec||-F, --env_eval=<var=val>||(use jsrun option)|
|Enable debug info to TotalView||-u, --debug||(use jsrun option)|
|Use Spindle||-L, --use_spindle||(use jsrun option)|
|Display version information and exit||-V, --version||-V, --version|
|Display help info||-h, --help||-?, --help||-h, --help|
|Display usage summary||--usage||--usage||--usage|
Environment variables are only propagated if they begin with JSM_, OMPI_, or OPAL_.