Batch System Primer | HPC @ LLNL

The following is a brief overview of batch system concepts. It is provided as a foundation to understanding the system for running jobs on the Sierra compute clusters.

What is a job?

Everyone would love to log into a High Performance Computing (HPC) system (cluster), run their application immediately and view the results as soon as possible.

HPC clusters support software applications that involve running multiple processes (or tasks) simultaneously. Typically those tasks communicate with one another to generate the desired results. An HPC cluster provides the computing resources to run those parallel tasks simultaneously and deliver the output to the user.

Given its popularity and investment, HPC clusters do not sit idle waiting for the next user to run their application. They instead maintain a wait list of applications and run each application as computing resources become available.

The user's application is not simply the name of an executable. It has input data and parameters, environment variables, descriptions of computing resources needed to run the application, and output directives. All of these specifications collectively are called a job. The user submits the job to an HPC cluster, usually in the form of a job script.

The job script contains a formal specification that requests computing resources, identifies an application to run along with its input data and environment variables, and describes how best to deliver the output data.

Batch System

The batch system is responsible for receiving the job script. If it cannot run the job immediately, the job script is added to a queue. The job waits in the queue until the job's requested resources are available. The batch system then runs the job. This process is called scheduling and the component within the batch system which identifies jobs to run, selects the resources for the job, and decides when to run the job is called the scheduler (aka workload manager).

The Scheduler

If the scheduler runs each job in the queue in order in which it was received, it is called a first come, first served (FCFS) scheduler. The last job submitted is added to the bottom of the queue. The rule for a FCFS scheduler is that no other job in the queue will run before the job that is at the top of the queue. That top job waits in the queue until enough jobs finish to free up the resources that the top priority job needs.

If the scheduler has the intelligence to launch jobs lower in the queue, on resources that are currently idle, it is called a back-fill scheduler. While this might appear to be an efficient way to utilize idle resources, if yours is the top priority job in the queue, you would not be happy if the launching of lower priority jobs delayed the start of your job.

Hence, the back-fill scheduler follows a strict rule to only schedule lower priority jobs on idle resources if it will not delay the start of the top priority job. Furthermore, if that rule prevents the launching of a lower priority job if it will delay the start of any higher priority job, then that scheduler is called a conservative back-fill scheduler.

Job Priority

Most schedulers have another activity that further enhances their utility. The fact of the matter is that running jobs on a FCFS basis may not be fair to all users. If one user submits 100 jobs early in the morning, other users will have to wait for the jobs of that first user to complete before their jobs can run.

Hence, modern schedulers provide a fairness component that re-prioritizes the queued jobs based on a system that gives an advantage to under-serviced users by moving their jobs higher in the queue. In addition, there are other components to prioritizing activities that consider other factors such as the size of the job or the job's importance. The term size refers to the number of nodes the job requests.

Queues

While a single job queue could service an entire cluster, an HPC cluster is typically partitioned into pools of node resources each with its own job queue. While most of a cluster's node resources are dedicated to running production jobs, some nodes are typically set aside to use in debugging applications. Theses two uses, production and debugging, are at cross purposes. Production runs tend to be full sized jobs that last multiple hours. Debugging sessions are typically smaller sized runs and are more short lived. Hence the batch queue is configured with wall clock and job size limits that favor production jobs. The debug queue has shorter time and smaller size limits that ensure more immediate access to resources for smaller, quick running jobs. Users specify the appropriate queue when the job is submitted.

Policy

The scheduler enforces policy. Each HPC computing center establishes policy or rules of what jobs can run and under what conditions. There are limits imposed on the size of the job, the length of time the job is allowed to run, which running jobs can be preempted and for what reasons, etc. The scheduler commonly provides commands for displaying the limits, access permissions, and service agreements it enforces.

Running a Job

When a job is scheduled to run, the requested compute resources are allocated to the job. No other job will run on those resources during that job's run. The job script is copied to the first compute node in the allocation or, in the case of the CORAL systems, a shared launch node and executed. The application is then launched across the allocated resources as specified by the commands in the job script. Typically multiple tasks are launched across multiple compute cores, GPU's, and nodes - all confined to that job's allocation.

Resource Manager

The agent of the batch system which launches the job's application is connected with the resource manager. The resource manager provides the infrastructure to control and monitor the job and collect the statistics of all the processes running all of the tasks in the job. These statistics are gathered, aggregated and ultimately saved to a database which will contain a record of that job's run.

When a user's job script is crafted to launch multiple applications and/or multiple invocations of an application, each such launch is termed a job step. The statistics for each job step are also captured and recorded.

Batch System Commands

The batch system provides a collection of commands for users to interact with the scheduler and resource manager. There are commands to submit a job, display the job queue, and see status and details of the job itself. In addition, there are commands which provide information on the computing resources, showing which resources are allocated, which are idle, and which are off-line or down.

Connecting the Dots

On Livermore Computing (LC) CTS clusters, Slurm serves as both the scheduler and resource manager. On our Sierra clusters, IBM Spectrum LSF is the job scheduler and IBM's Cluster System Management (CSM) is the resource manager.