Running Jobs

IMPORTANT: all of the CORAL systems currently use ssh for task launch. This means that you need to have passwordless ssh keys set up in order to run successfully. Instructions for setting up ssh keys can be found on this confluence page or by searching the web.

 

Get a dedicated compute node for running parallel compiles, debugging, etc.

$ lalloc 1

The lalloc wrapper script gets an allocation and drops the user at a shell prompt on the first compute node in that allocation. lalloc -h will give you more details on other options. In particular, note that if you wish to submit multiple job steps (jsrun / lrun) interactively, we recommend using the --shared-launch option as one failed or cancelled job step can kill the jsmd on the compute node, which will prevent you from launching more job steps from that compute node.

Submit a batch script to run one or more job steps on a compute node or nodes

$ cat tennode.bsub
#!/bin/bash
#BSUB -nnodes 10
#BSUB -q pbatch

lrun -T4 myapp input1

$ bsub tennode.bsub

The lrun wrapper script provides a simple syntax for launching job steps. In this example, lrun -T 4 myapp ... is telling lrun to launch myapp with 4 tasks on each node in my allocation. lrun may also be launched with -n<ntasks> and/or -N<nnodes> and take jsrun options for more detailed task layout options. You may also use jsrun directly to launch job steps. See the srun vs jsrun page for more details on jsrun options.

You can also launch multiple job steps serially or in parallel within a batch script. E.g.

$ cat twosteps.bsub
#!/bin/bash
#BSUB -nnodes 10
#BSUB -q pbatch

lrun -N5 -T4 myapp input1 &
lrun -N5 -T4 myapp input2 &
wait

$ bsub twosteps.bsub

This will get a 10 node allocation and then run myapp with input1 on 5 of those nodes and myapp with input2 on the remaining 5 nodes.

Querying the Queue

The following commands are useful for querying the queue on all LSF systems.

Get a summary of all jobs and partitions on an LSF system

$ lsfjobs

See only your jobs in the queue

$ bquery

See all the jobs in the queue

$ bquery -u all

List queued jobs displaying the fields that are important to you

$ man bquery

and scroll to the "Output fields for bquery" listed under the -o option.  Then create an environment variable that contains the fields you like to see.

For example, for bash:

$ export LSB_BQUERY_FORMAT="id:- user:-8 user_group:- queue:- nexec_host:- stat: start_time: run_time: finish_time: priority: exec_host:32"

and for csh:

$ setenv LSB_BQUERY_FORMAT "id:- user:-8 user_group:- queue:- nexec_host:- stat: start_time: run_time: finish_time: priority: exec_host:32"

Now run bquery again, but this time adding the -u all option to see all user jobs:

$ bquery -u all
  JOBID     USER      USER_GROUP      QUEUE NEXEC_HOST STAT  START_TIME   RUN_TIME        FINISH_TIME    JOB_PRIORITY EXEC_HOST                       
   4136   arnold          guests     exempt          1 RUN   Dec 11 16:16 2765200 second( Feb  9 16:16 L 515          20*ray44                        
   6109     mike          guests     pbatch          1 RUN   Jan 12 15:57 1545 second(s)  Jan 12 16:27 L 512          2*ray51                         
   6115    susan          guests     pbatch          1 RUN   Jan 12 16:13 596 second(s)   Jan 12 16:43 L 512          ray28

Display details about a specific job

$ bquery -l jobid

Display the job script for one of your jobs

$ cat jobid.out

LSF inserts your batch script into your job's output file.

Show all the jobs you have run today

$ bhist -d

List the charge accounts you are permitted to use (bsub -G option)

$ lshare -u username

Display the factors contributing to each pending job's assigned priority

$ bquery -prio jobid

Cancel a job, whether it is pending in the queue or running

$ bkill jobid

Send a signal to a running job

For example, send SIGUSR1:

$ bkill -s USR1 jobid

Display the queues available

$ bqueues

Display details about all the queues

$ bqueues -l