LC Hotline: 2-4531

From offsite: (925) 422-4531

 

Hours

Monday–Friday
8am–12pm, 1–4:45pm
B453 R1103 | Q-clearance area

 

Slurm srun versus IBM CSM jsrun

Job Step Launch under Slurm and Cluster System Management (CSM)

Both Slurm and IBM's CSM provide a way to launch tasks (aka, Linux processes) of a user's application in parallel across resources allocated to the job.  For Slurm, the command is srun; for CSM, the command is jsrun.  This page presents their similarities and their differences.

The Beta 1 version of jsrun is currently installed on the CORAL Early Access systems.  The enhancements that come in the Beta 2 version are marked in gold.

Similarities

A job submission to the batch system (Slurm's sbatch or LSF's bsub) results in a resource allocation when the job runs.  Both srun and jsrun rely on the batch system to allocate the computing resources for the job.  (srun can do both resource allocation and job launch, but that is a difference discussed below).  Both commands accept the file name of the user's application to launch as well as the number of tasks to launch.  In response to an srun or jsrun invocation, both systems will select the resources (nodes, cores, memory, GPU's) on which to launch the tasks based on the arguments to the respective commands, and then spawn the requested number of tasks.  A successful launch by either command is known as a job step.

Differences

The first difference is in terms.  What Slurm refers to as "nodes", LSF refers to as "hosts".  The discussion below will reflect that:  srun launches tasks across nodes, jsrun launches tasks across hosts.  In addition, Slurm defines the term CPU to generically refer to cores or hardware threads, depending on the node's configuration.  Where Simultaneous Multithreading (SMT) is not available or disabled, "CPU" refers to a core.  Where SMT is available and enabled, "CPU" refers to a hardware thread.  There is no such distinction with jsrun; the processing element is the CPU.

The next difference is how the resources are selected.  srun provides the (-N, --nodes) option to specify a number of nodes.  The number of CPUs srun selects will by default be equal to the number of requested tasks.  With those two esential specifications, srun will find the CPUs and nodes on which to launch the tasks.  Other options to srun allow memory and GPUs to be independently specified.

jsrun's fundamental computing resource unit is known as the resource set.  jsrun options allow the user to define the contents of a resource set (CPUs, memory and GPUs) and the system will find and select the requested number of resource sets from the hosts allocated to the job.  The default mapping is one task per resource set.

Both commands provide a means to bind a task (process) to the CPU(s), memory, and GPU(s) selected for it.  srun offers the --cpu_bind, --mem_bind, and --acel-bind options to do this, using a number of optional patterns.  jsrun always confines tasks to their selected resource sets and offers the --bind option to bind tasks to CPUs.

Similarly, both commands provide the ability to specify the task to resource mapping pattern:  block, cyclic or plane.

As an aside, Livermore Computing offers the mpibind SPANK plugin on their Linux clusters in place of srun's native --cpu_bind option.  This offers a default mapping and binding selection that considers the OpenMP thread count and makes optimal mapping/binding decisions by default for most cases.  It is unknown at this time whether mpibind will need to be ported to the jsrun environment or if jsrun will by default mimic the behavior of mpibind.

Typically, srun is invoked from within a batch script.  The script is submitted to the Slurm scheduler using the sbatch command.  Alternatively, users can invoke salloc.  The salloc command will block until the job runs, and then return a prompt.  From that prompt, users can invoke srun one or multiple times within the same resource allocation.  This form of job submission is known as an interactive job.

Slurm allows the srun command to be invoked directly from the command line, i.e., outside of an sbatch or salloc job allocation.  When that happens, srun will both request and receive a resource allocation and subsequently launch one job step on the allocated resources and return.  The jsrun command does not offer this ability.

The following table presents the equivalent options for both commads.  This is a curated list for comparison purposes.  For the complete list see the srun and jsrun man pages.

Option

srun

jsrun

Number of tasks -n, --ntasks -p, --np
Allow more than one task per core -O, --overcommit  
Number of CPUs per task (default is 1) -c, --cpus-per-task (defined with resource set)
Number of nodes -N, --nnodes --rs_per_host 1 --nrs <x>
Required memory per job step --mem (defined with resource set)
Required memory per CPU --mem-per-cpu (defined with resource set.  see --memory_per_rs)
Number of generic resources (e.g. GPUs) --gres  
Number of CPUs per resource set N/A -c, --cpus_per_rs
Amount of memory per resource set N/A -m, --memory_per_rs
Number of GPUs per resource set N/A -g, --gpus_per_rs
Number of resource sets N/A -n, --nrs
Number of resource sets per host N/A -r, --rs_per_host
Number of tasks per resource set   -a, --tasks_per_rs
Specify the task to resource mapping pattern (block, cyclic, plane or file-defined) -m, --distribution --launch_distribution
Bind tasks to allocated CPUs --cpu_bind -b, --bind
Bind tasks to allocated memory --mem_bind (confined to resource set)
GPU/MIC/NIC binding preference --accel-bind (confined to resource set)
Performance binding preference --hint -l, --latency_priority
Only select nodes with the specified features -C, --constraint N/A
Standard input redirection -i, --input -t, --stdio-input
Standard output redirection -o, --output -o, --stdio-stdout
Output distribution (one file per task or one file per job step) filename format specifiers -e, --stdio-mode
Standard error redirection -e, --error -k, --stdio-stderr
Kill entire job step if any task has non-zero exit -K, --kill-on-bad-exit -X, --exit_on_error
Wait specified time after first task exits before killing entire job step -W, --wait  
Do not kill job step when a node fails -k, --no-kill TBD
Run a prolog and/or epilog before/after every job step --prolog, --epilog -P, --pre_post_exec=<script info>
Do not run more than one task on its resources --exclusive --tasks_per_rs 1
Allow up to specified number of job steps to share resources -s, --oversubscribe -s, --shared
Do not launch job step if resources are unavailable within specified time limit -I, --immediate -i, --immediate
Associate a job name with this job step -J, --job-name  
Prepend a task ID to every line of output -l, --label --stdio-mode prepend
Run different applications under one job step --multi-prog -f, --appfile=<file>
Attach a pseudo terminal to task 0 --pty  
Terminate job step with a single ^C (instead of two in succession) -q, --quit-on-interrupt  
Run job step on specific nodes within allocation -w, --nodelist -U, --use_resources=<filename>
Save sepcific resources of last job step to a file   -S, --save_resources=<filename>
Do not run job step on specific nodes -x, --exclude -U, --use_resources=<filename>
Start job step at a specific node offset within allocation -r, --relative -U, --use_resources=<filename>
Impose a time limit on the job step's run time -t, --time  
Turn off libc's buffering to standard out -u, --unbuffered  
Launch step in a different directory from where invoked -D, --chdir -h, --chdir
Pass environment variables to the execution environment --export -E, --env
Environment will evaluating and setting to occur just before exec   -F, --env_eval=<var=val>
Enable debug info to TotalView   -u, --debug
Use Spindle   -L, --use_spindle
Display version information and exit -V, --version -V, --version
Display help info -h, --help -?, --help
Display usage summary --usage --usage

Environment variables are only propagated if they begin with JSM_, OMPI_, or OPAL_.