The El Capitan systems are managed by the flux resource scheduler. When a set of parallel processes is started on a specific node, LC uses the MPIBind tool to map processes to the underlying hardware resources.

Transitioning from Slurm to Flux

Many users will not need to take any action to transition from slurm to flux, instead, they will be able to interact using the flux_wrappers.

By default, the flux_wrappers package is loaded when users log in. These wrappers provide "slurm-like" commands which wrap underlying flux commands. Available commands include ‘srun’, ‘sbatch’, ‘salloc’, ‘sxterm’, ‘scancel’, ‘squeue’, ‘showq’, and ‘sinfo’. You can add a ‘-v’ flag to most of these commands to see the Flux command that is being executed. 

demo of flux_wrappers

# flux_wrappers provid slurm-like interface to flux commands
$ which sinfo
/usr/global/tools/flux_wrappers/bin/sinfo
 
# show underlying flux command with `-v`
$ sinfo -v
#running : flux resource list
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
...

Learn Flux

Flux is built from the ground-up for complex and fine-grained scheduling needs. It is particularly useful for regression testing or UQ style pipelines.

Flux resources:

  • LC's Batch System Cross Reference Guide. This is a quick reference for translating between different batch scheduling systems.
  • Check out the Flux Cheatsheet
  • Convert existing slurm sbatch script to flux format using the slurm2flux utility on LC systems. This utility will make the following translations:
    • #SBATCH options will become #flux options
    • srun commands will become flux run commands.
  • Flux Tutorial for LC users
  • LC user meeting presentation on Current State of Flux in LC, July 26, 2022

Get Help with Flux

Our flux team is supporting users in a dedicated mattermost channel under the DOE-wide HPC mattermost team (invite link). Join the ‘flux’ channel after joining through the invite link.

MPIBind

LC uses MPIBind by default. It will try to do the "right" thing, and evenly divide up all GPUs and CPUs evenly across the number of tasks. You can disable MPIBind, but it is not recommended. 

Learn MPIBind

Understanding MPIBind Bindings

The following examples use a simple test program to demonstrate how a program is launched and bound to the hardware. The simple MPI program is written in C and prints out a message from each MPI rank.

This can be compiled with

$ cc simple.c

Flux Binding Example

The key when using flux is to add the --exclusive flag. This flag ensures that the job gets access to ALL resources on the allocated nodes. Without this flag, users may find that they are only given access to a subset of the available resources.

$ flux run -n 16 --verbose --exclusive --nodes=1 --setopt=mpibind=verbose:1 ./a.out 
jobid: fA2p8hQE3
0.064s: flux-shell[0]: mpibind:
mpibind: task  0 nths  4 gpus 4 cpus 0-3
mpibind: task  1 nths  4 gpus 4 cpus 4-7
mpibind: task  2 nths  4 gpus 5 cpus 8-11
mpibind: task  3 nths  4 gpus 5 cpus 12-15
mpibind: task  4 nths  4 gpus 2 cpus 16-19
mpibind: task  5 nths  4 gpus 2 cpus 20-23
mpibind: task  6 nths  4 gpus 3 cpus 24-27
mpibind: task  7 nths  4 gpus 3 cpus 28-31
mpibind: task  8 nths  4 gpus 6 cpus 32-35
mpibind: task  9 nths  4 gpus 6 cpus 36-39
mpibind: task 10 nths  4 gpus 7 cpus 40-43
mpibind: task 11 nths  4 gpus 7 cpus 44-47
mpibind: task 12 nths  4 gpus 0 cpus 48-51
mpibind: task 13 nths  4 gpus 0 cpus 52-55
mpibind: task 14 nths  4 gpus 1 cpus 56-59
mpibind: task 15 nths  4 gpus 1 cpus 60-63
Number of tasks= 16 My rank= 9
Number of tasks= 16 My rank= 8
Number of tasks= 16 My rank= 15
Number of tasks= 16 My rank= 14
Number of tasks= 16 My rank= 13
Number of tasks= 16 My rank= 12
Number of tasks= 16 My rank= 11
Number of tasks= 16 My rank= 10
Number of tasks= 16 My rank= 7
Number of tasks= 16 My rank= 6
Number of tasks= 16 My rank= 4
Number of tasks= 16 My rank= 1
Number of tasks= 16 My rank= 0
Number of tasks= 16 My rank= 3
Number of tasks= 16 My rank= 2
Number of tasks= 16 My rank= 5