The El Capitan systems are subject to LC policies. Users are encouraged to visit our pages on Allocations and Priority as well as the individual job limits policies for each system (Tuolumne, RZAdams, El Capitan). For requests outside of these limits, please use the ASC Dedicated Access Time request form.
The El Capitan systems are managed by the Flux resource scheduler. When a set of parallel processes is started on a specific node, LC uses the mpibind tool to map processes to the underlying hardware resources.
Banks
Users must have a bank (also known as "project" or "charge account") in order to run jobs.
Users and PIs can view reports of bank usage with the following commands:
- flux account view-user <username> — View detailed flux accounting information for a specific user.
- bankinfo — Similar to mshare, this utility displays bank usage and related information in the full tree structure.
- quickreport — Similar to lreport, this utility can show historical bank usage information, including per-user usage.
Standby
The El Capitan systems have implemented a special bank, standby, which is available to users. This bank allows a Slurm-like Quality of Service mode. Jobs run with the --bank=standby flag will be given the lowest priority. These jobs will be pre-empted by any subsequently scheduled jobs.
Transitioning from Slurm to Flux
Many users may not need to take any action to transition from Slurm to Flux, instead, they will be able to interact using the flux_wrappers. These scripts will allow you to get up and running quickly with simple jobs while you learn Flux.
By default, the flux_wrappers package is loaded when users log in. These wrappers provide "Slurm-like" commands which wrap underlying Flux commands. Available commands include ‘srun’, ‘sbatch’, ‘salloc’, ‘sxterm’, ‘scancel’, ‘squeue’, ‘showq’, and ‘sinfo’. You can add a ‘-v’ flag to most of these commands to see the Flux command that is being executed.
Demo of flux_wrappers
The below demonstrates the command line use of the slurm commands as provided of the flux_wrappers.
# flux_wrappers provid slurm-like interface to flux commands $ which sinfo /usr/global/tools/flux_wrappers/bin/sinfo # show underlying flux command with `-v` $ sinfo -v #running : flux resource list PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ...
Converting sbatch Scripts
LC provides the slurm2flux utility which will make the following translations:
- #SBATCH options will become #flux options
- srun commands will become flux run commands.
Using Flux
Flux is built from the ground-up for complex and fine-grained scheduling needs. It is particularly useful for regression testing or UQ style pipelines.
New flux users are encouraged to visit these pages:
- Flux QuickStart Guide for LC users
- LC's Batch System Cross Reference Guide. This is a quick reference for translating between different batch scheduling systems.
- In-depth Flux Tutorial for LC users
Additional Resources
- Check out the Flux Cheatsheet
-
Reference the flux-core manual pages, which include extensive documentation on the Flux Python API
- Manual pages are available on all of the LC TOSS4 systems using the man command
- LC user meeting presentation on Current State of Flux in LC, July 26, 2022
Get Help with Flux
Our Flux team is supporting users in a dedicated mattermost channel under the DOE-wide HPC mattermost team (invite link). Join the ‘flux’ channel after joining through the invite link.
mpibind
LC uses mpibind by default. It will try to do the "right" thing, and evenly divide up all GPUs and CPUs evenly across the number of tasks. You can disable mpibind, but it is not recommended.
Learn mpibind
- mpibind for Flux users tutorial. Test programs which print out the mapping for MPI and OpenMP programs can be found in the mpibind repo affinity directory
- mpibind tutorial on Discovering node architecture topology
- mpibind tutorial on Flux affinity on the AMD MI300A APU
Understanding mpibind Bindings
The following examples use a simple test program to demonstrate how a program is launched and bound to the hardware. The simple MPI program is written in C and prints out a message from each MPI rank.
This can be compiled with
$ cc simple.c
Flux Binding Example
For workflows that aren't trying to run multiple jobs per node, the key when using Flux with mpibind is to add the --exclusive flag to your flux run or flux submit command. This flag ensures that the job gets access to ALL resources on the allocated nodes. Without this flag, Flux will give the job only the exact resources that it requested, which limits mpibind's ability to spread tasks across the node's resources.
$ flux run -n 16 --verbose --exclusive --nodes=1 --setopt=mpibind=verbose:1 ./a.out jobid: fA2p8hQE3 0.064s: flux-shell[0]: mpibind: mpibind: task 0 nths 4 gpus 4 cpus 0-3 mpibind: task 1 nths 4 gpus 4 cpus 4-7 mpibind: task 2 nths 4 gpus 5 cpus 8-11 mpibind: task 3 nths 4 gpus 5 cpus 12-15 mpibind: task 4 nths 4 gpus 2 cpus 16-19 mpibind: task 5 nths 4 gpus 2 cpus 20-23 mpibind: task 6 nths 4 gpus 3 cpus 24-27 mpibind: task 7 nths 4 gpus 3 cpus 28-31 mpibind: task 8 nths 4 gpus 6 cpus 32-35 mpibind: task 9 nths 4 gpus 6 cpus 36-39 mpibind: task 10 nths 4 gpus 7 cpus 40-43 mpibind: task 11 nths 4 gpus 7 cpus 44-47 mpibind: task 12 nths 4 gpus 0 cpus 48-51 mpibind: task 13 nths 4 gpus 0 cpus 52-55 mpibind: task 14 nths 4 gpus 1 cpus 56-59 mpibind: task 15 nths 4 gpus 1 cpus 60-63 Number of tasks= 16 My rank= 9 Number of tasks= 16 My rank= 8 Number of tasks= 16 My rank= 15 Number of tasks= 16 My rank= 14 Number of tasks= 16 My rank= 13 Number of tasks= 16 My rank= 12 Number of tasks= 16 My rank= 11 Number of tasks= 16 My rank= 10 Number of tasks= 16 My rank= 7 Number of tasks= 16 My rank= 6 Number of tasks= 16 My rank= 4 Number of tasks= 16 My rank= 1 Number of tasks= 16 My rank= 0 Number of tasks= 16 My rank= 3 Number of tasks= 16 My rank= 2 Number of tasks= 16 My rank= 5