Using El Capitan Systems: Running Jobs with Flux and mpibind

The El Capitan systems are managed by the Flux resource scheduler. When a set of parallel processes is started on a specific node, LC uses the mpibind tool to map processes to the underlying hardware resources.

Transitioning from Slurm to Flux

Many users will not need to take any action to initially transition from Slurm to Flux, instead, they will be able to interact using the flux_wrappers. These scripts will allow you to get up and running quickly with simple jobs while you learn Flux.

By default, the flux_wrappers package is loaded when users log in. These wrappers provide "Slurm-like" commands which wrap underlying Flux commands. Available commands include ‘srun’, ‘sbatch’, ‘salloc’, ‘sxterm’, ‘scancel’, ‘squeue’, ‘showq’, and ‘sinfo’. You can add a ‘-v’ flag to most of these commands to see the Flux command that is being executed.

Demo of flux_wrappers

The below demonstrates the command line use of the slurm commands as provided of the flux_wrappers.

# flux_wrappers provid slurm-like interface to flux commands
$ which sinfo
/usr/global/tools/flux_wrappers/bin/sinfo
 
# show underlying flux command with `-v`
$ sinfo -v
#running : flux resource list
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
...

Learn Flux

Flux is built from the ground-up for complex and fine-grained scheduling needs. It is particularly useful for regression testing or UQ style pipelines.

Flux resources:

LC's Batch System Cross Reference Guide. This is a quick reference for translating between different batch scheduling systems.
Check out the Flux Cheatsheet
Convert existing slurm sbatch script to flux format using the slurm2flux utility on LC systems. This utility will make the following translations:
- #SBATCH options will become #flux options
- srun commands will become flux run commands.
Reference the flux-core manual pages, which include extensive documentation on the Flux Python API
- Manual pages are available on all of the LC TOSS4 systems using the man command
Flux Tutorial for LC users
LC user meeting presentation on Current State of Flux in LC, July 26, 2022

Get Help with Flux

Our Flux team is supporting users in a dedicated mattermost channel under the DOE-wide HPC mattermost team (invite link). Join the ‘flux’ channel after joining through the invite link.

mpibind

LC uses mpibind by default. It will try to do the "right" thing, and evenly divide up all GPUs and CPUs evenly across the number of tasks. You can disable mpibind, but it is not recommended.

Learn mpibind

mpibind for Flux users tutorial. Test programs which print out the mapping for MPI and OpenMP programs can be found in the mpibind repo affinity directory
mpibind tutorial on Discovering node architecture topology
mpibind tutorial on Flux affinity on the AMD MI300A APU

Understanding mpibind Bindings

The following examples use a simple test program to demonstrate how a program is launched and bound to the hardware. The simple MPI program is written in C and prints out a message from each MPI rank.

simple.c

This can be compiled with

$ cc simple.c

Flux Binding Example

For workflows that aren't trying to run multiple jobs per node, the key when using Flux with mpibind is to add the --exclusive flag to your flux run or flux submit command. This flag ensures that the job gets access to ALL resources on the allocated nodes. Without this flag, Flux will give the job only the exact resources that it requested, which limits mpibind's ability to spread tasks across the node's resources.

$ flux run -n 16 --verbose --exclusive --nodes=1 --setopt=mpibind=verbose:1 ./a.out 
jobid: fA2p8hQE3
0.064s: flux-shell[0]: mpibind:
mpibind: task  0 nths  4 gpus 4 cpus 0-3
mpibind: task  1 nths  4 gpus 4 cpus 4-7
mpibind: task  2 nths  4 gpus 5 cpus 8-11
mpibind: task  3 nths  4 gpus 5 cpus 12-15
mpibind: task  4 nths  4 gpus 2 cpus 16-19
mpibind: task  5 nths  4 gpus 2 cpus 20-23
mpibind: task  6 nths  4 gpus 3 cpus 24-27
mpibind: task  7 nths  4 gpus 3 cpus 28-31
mpibind: task  8 nths  4 gpus 6 cpus 32-35
mpibind: task  9 nths  4 gpus 6 cpus 36-39
mpibind: task 10 nths  4 gpus 7 cpus 40-43
mpibind: task 11 nths  4 gpus 7 cpus 44-47
mpibind: task 12 nths  4 gpus 0 cpus 48-51
mpibind: task 13 nths  4 gpus 0 cpus 52-55
mpibind: task 14 nths  4 gpus 1 cpus 56-59
mpibind: task 15 nths  4 gpus 1 cpus 60-63
Number of tasks= 16 My rank= 9
Number of tasks= 16 My rank= 8
Number of tasks= 16 My rank= 15
Number of tasks= 16 My rank= 14
Number of tasks= 16 My rank= 13
Number of tasks= 16 My rank= 12
Number of tasks= 16 My rank= 11
Number of tasks= 16 My rank= 10
Number of tasks= 16 My rank= 7
Number of tasks= 16 My rank= 6
Number of tasks= 16 My rank= 4
Number of tasks= 16 My rank= 1
Number of tasks= 16 My rank= 0
Number of tasks= 16 My rank= 3
Number of tasks= 16 My rank= 2
Number of tasks= 16 My rank= 5