IBM Spectrum LSF is a batch scheduler that allows users to run their jobs on Livermore Computing’s (LC) Sierra (CORAL) high performance computing (HPC) clusters. IBM Cluster System Management (CSM) is the resource manager for the Sierra systems.
This document is intended to present the basics of Spectrum LSF. or the complete guide to using LSF, see the on-line user manual.
Computing Resources
An HPC cluster is made up of a number of compute hosts, each with a complement of processors, memory, GPUs, and burst buffers (burst buffer technology). Spectrum LSF refers to compute nodes as hosts. The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).
With the advent of Simultaneous Multithreading (SMT) architectures, single cores can have multiple hardware threads (sometimes known as hyper-threads). The processing elements are generically called a core.
The Batch Scheduler and Resource Manager
The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time. When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job's allocated resources. This is also known as “running the job”.
The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job. Other conditions include the availability of a specific license or access to a specific file system.
Anatomy of a Batch Job
A batch job requests computing resources and specifies the application(s) to launch on those resources along with any input data/options and output directives. The user submits the job, usually in the form of a batch job script, to the batch scheduler.
LSF provides the #BSUB directives which, when placed at the top of a job script, will convey any of the bsub command line options. When the same option is specified on the command line and as a #BSUB directive in the job script, the bsub command line option will take precedence.
The batch job script is composed of four main components:
- The interpreter used to execute the script
- #BSUB directives that convey default submission options.
- The setting of environment and/or script variables (if necessary)
- The application(s) to execute along with its input arguments and options.
Here is an example of a batch script that requests 8 hosts under the “science” charge account and launches 32 tasks of myApp across the 8 allocated hosts:
#!/bin/bash #BSUB -nnodes 8 #BSUB -G science jsrun --nrs 32 --rs_per_host 4 --np 32 myApp
When the job is scheduled to run, the resource manager will execute the batch job script on a shared launch node. Please make sure all computationally demanding commands are run under jsrun or lrun. For details on how to write job scripts, see this page.
Batch Jobs
The bsub command is used to submit a batch script to LSF. It is designed to reject the job at submission time if there are requests or constraints that LSF cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections.
You can invoke LSF's bsub submission utility by just giving it a command as an argument, i.e.:
bsub bsub_flags command
It is much more common to collect your bsub flags as lines starting with #BSUB and commands into a job script that bsub can parse. The job script can be read from the standard input as:
bsub < job.script
or given as an argument to the bsub command:
bsub job.script
Interactive Jobs
An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. The bsub -Is [bash|csh] command is used to submit an interactive job to LSF. When the job runs, a command line prompt will appear, and the user can launch their application(s) across the computing resources which have been allocated to the job. The current LLNL default will put your shell on the first compute node in your allocation. You can then launch parallel job steps with jsrun or lrun. See the Jobs and Job Steps section for more information on jsrun and lrun.
You may also use the lalloc wrapper script to quickly get an interactive session on a compute node. This is especially useful for getting a dedicated node for parallel compiles or debugging. The basic syntax is lalloc <#_of_nodes>. lalloc -h will give you more detailed options.
Xterm Jobs
An xterm job is a job that launches an xterm window when the job runs. The bsub -XF xterm command is used to submit an xterm job to LSF. When the job is runs, an xterm window appears on the desktop of the user who invoked bsub -XF xterm. At that point, the user can launch their application(s) from the xterm window across the computing resources which have been allocated to the job. As with the interactive jobs described above, the LLNL default will start your xterm on the first compute node of your allocation.
Execution Environment
For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit).
Use the bsub -env option to convey specific environment variables to the execution environment.
LSF by default imposes a default set of user shell soft limits in the job's execution environment. The bsub -ul option will override this default behavior and pass the limits in effect in the shell where bsub is invoked. Users also have a number of bsub options that individually insert soft limits into the execution environment. There is a configuration variable in effect that makes MB the default unit when specifying such limits. For example, bsub -M 1 will yield a ulimit of "max memory size (kbytes, -m) 1024" in the execution environment.
Here is the list of bsub options that can insert individual process limits into the execution environment. Bear in mind, that any limit conveyed must be less than or equal to the system's hard limits.
- Core file size: bsub -C
- CPU time in milliseconds: bsub -c
- Data size: bsub -D
- Maximum file size: bsub -F
- Resident set size: bsub -M
- Stack size: bsub -S
- Virtual (swap) memory: bsub -v
Environment Variables
LSF recognizes and provides a number of environment variables.
The first category of environment variables are those that LSF inserts into the job's execution environment. These convey to the job script and application information such as job ID (LSB_JOBID) and task ID (LS_JOBPID). For the complete list, see Environment variables set for job execution.
The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see Environment variable reference.
Finally, LSF allows the user to customize the output of the bquery command by setting the LSB_BQUERY_FORMAT variable in the environment from which you invoke bquery. For more info on customizing the output of bquery, see the bquery -o info page.
Job Output
LSF merges the job's error and output by default and inserts job report information into the job's output. This information includes the submitting user and host, the execution host, the CPU time (user plus system time) used by the job, and the exit status. When the job completes, the standard LSF behavior is to email the job's output to the user.
LC installs a job submit filter to automatically generate an output file by default instead of sending email. You can always specify an output and error file to the bsub command using the -o and -e options respectively. LSF will append the job's output to the specified file(s). If you want the output to overwrite any existing files, use the -oo and -ee options instead.
If the final character of the -o and -e options is a slash (/), LSF will create a directory with that name and write the job's output and error files to that directory using the jobid.out and jobid.err naming formats.
Serial vs. Parallel jobs
Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.
Parallel and serial applications must be launched by the jsrun or lrun command. LLNL wrapper scripts cause commands that are not run under jsrun or lrun to run on the first compute node of the allocation.
As an ATS cluster, the Sierra clusters are ideal for running parallel jobs.
Jobs and Job Steps
The job requests computing resources and when it runs, the scheduler selects and allocates those resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and xterm jobs.
When an application is launched using jsrun, it is called a “job step”. The jsrun command causes the simultaneous launching of multiple tasks of a single application. Arguments to jsrun specify the number of tasks to launch as well as the number of resource sets (cores, memory and GPUs) on which to launch the tasks.
jsrun can be invoked sequentially or in parallel (by backgrounding them). Furthermore, the number of resource sets specified by jsrun (the --nrs option) can total less than, but no more than the number of resource sets (cores, memory and GPUs) available in the job allocation.
You may also use the lrun wrapper script to launch job steps. lrun allows you to use a more familiar, srun like syntax to specify tasks, task-per-node, etc.
Job Queues
A typical cluster is typically busy running jobs and will probably not be able to run a job when it is submitted. So typically, the job is placed in a queue. Specific compute host resources are defined for every job queue.
Each queue can be configured with a set of limits which specify the requirements for every job that can run in that queue. These limits include job size, wall clock limits, and the users who are allowed to run in that queue.
An LC convention is to have the following two queues on every cluster:
- pbatch - the production queue for running production jobs.
- pdebug - the debug queue providing quick turnaround for shorter and smaller jobs.
The bqueues command lists all the queues currently configured. bqueues -l provides details about each queue.
The bquery -u all command lists all the jobs currently in the system, one line per job.
Quality of Service (QoS)
LSF does not support the QoS model used by Slurm. Instead, additional queues have been added to deliver QoS:
- pbatch (nominal priority and standard job size and wall clock time limits)
- expedite (higher job priority and an exemption from job size and wall clock time limits)
- exempt (normal job priority and an exemption from job size and wall clock time limits)
- standby (below normal job priority and an exemption from job size and wall clock time limits)
Only certain users are granted the permission to submit jobs to the exempt and expedite queues. Users are typically granted access to the pbatch and standby queues.
Charge Accounts (Banks)
Users must request a charge (aka bank) account for each job they submit or have a valid charge account assigned by default. If the user is not assigned to any charge accounts, they cannot submit a job to the batch system. Computing resources allocated to a job are tracked and charged to the job’s specified charge account.
The user group serves as the charge account in LSF. One specifies a user group using the bsub -G option. LSF provides the LSB_DEFAULT_USERGROUP environment variable to convey a user group by default at bsub submission time. If LSB_DEFAULT_USERGROUP is not set in the shell where bsub is invoked, users must specify a user group either using the bsub -G user_group option or adding a #BSUB -G user_group directive to the start of their job script.
As with other LC systems, shares are assigned to each user group (bank account) and each user. Users are nominally assigned one share of each user group they have membership in. The bugroup -l command provides a listing of all the defined user groups, their assigned shares and the users who belong to each user group.
Job Priority
Jobs will be ordered in the queue of pending jobs based on a number of factors. The scheduler will always be looking to schedule the job that is at the top of the queue. The scheduler is also configured to scheduler jobs lower in the queue if doing so does not delay the start of any higher priority queue. This is known as conservative backfill.
The active factors that contribute to a job’s priority can be seen by invoking the bquery -aps command. These factors include:
- FS (fairshare): a number derived from the difference between the shares of the cluster that have been allotted to a user for a specific charge account and the usage accrued to the user and charge account, as well as any parent charge accounts.
- JPRIORITY: job specific factors. This is dominated by the jobs age, a number proportional to the period of time that has elapsed since the job was submitted to the queue. Note: time during which queued jobs in a held state does not contribute to the age factor.
- QPRIORITY: a baseline priority for the queue that the job was submitted to. The expedite queue has much higher QPriority than the normal pbatch queue, which has higher priority than the standby queue.
For a more detailed description of the algorithms for calculating job priority, see LSF Job Priority.
Job Status
Most of a job’s characteristics can be seen by invoking bquery -l jobid. LSF captures and reports the exit code of the job script (bsub jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.
A job’s record remains in LSF’s memory for 5 minutes after it completes. bquery -l jobid will return “Job jobid is not found” for a job that completed more than 5 minutes ago. At that point, one must invoke the bhist command to retrieve the job’s record from the LSF database. Normal users can only see detail information and history for their own jobs.
Modifying a Batch Job
Many of the batch job specifications can be modified after a batch job is submitted and before it runs. Typical fields that can be modified include the job size (number of nodes), queue, and wall clock limit. Job specifications cannot be modified by the user once the job enters the RUN state.
The bmod command is used to modify a job's specifications. For example:
- bquery -l jobid displays all of a job's characteristics
- bmod -G science jobid changes the job's account to the science account
- bmod -q pbatch jobid changes the job's queue to the pbatch queue
Holding and Releasing a Batch Job
If a user's job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the bstop jobid command to place the job into a PSUSP state. Jobs in the held state do not accrue any job priority based on queue wait time. Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the bresume jobid command.
Signaling and Cancelling a Batch Job
Pending jobs can be cancelled (withdrawn from the queue) using the bkill command (bkill jobid). The bkill command can also be used to terminate a running job. The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.
The -s option of the bkill command (bkill -s signal jobid) allows the user to issue any signal to a running job.
Job States
The basic job states are these:
- PEND - the job is in the queue, waiting to be scheduled
- PSUSP / HELD - the job was submitted, but was put in the suspended state (ineligible to run)
- DEPEND - the job is waiting for a dependency condition to be met (generally for a job to complete)
- DEPENDF - the job's dependency condition cannot be met due to the exit state of the job it depends on
- RUN - the job has been granted an allocation. If it’s a batch job, the batch script has been run
- DONE - the job has completed successfully
For the complete list, see About job states.
Pending Reasons
LSF attempts to explain why a job will not start by listing the reasons that it cannot start on different sets of compute nodes. You can get a complete view of why a job is not starting on all compute hosts with bquery -p3 jobid.
Displaying Computing Resources
As stated above, computing resources are hosts, cores, memory, and GPUs. The resources of each compute host can be seen by running the bhosts and lshosts commands.
The characteristics of each queue can be seen by running the bqueues command. Finally, a load summary report for each host can be seen by running lsload.
User Permissions and Limits
The charge accounts each user is permitted to use can be seen by running the bugroup command. In addition, the limits associated with the use each queue can be seen by running bqueues -l.
Job Statistics and Accounting
The bhist command displays historical information about jobs.
Time Remaining in an Allocation
If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, LSF will send a signal to the job when the remaining time of allocation is due to expire.
Use the bsub -wa signal -wt rem_time option to request a signal (like USR1 or USR2) at rem_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.