LBANN's hpc-launcher for PyTorch runs

LBANN's hpc-launcher

We encourage you to try hpc-launcher on LC systems. This is especially important on our AMD GPU systems, where using hpc-launcher can improve performance and even prevent dead hangs, especially on multi-node jobs.

Hpc-launcher is a tool provided by the LBANN project. Profiles for specific machines & architectures do base configuration, setting environment variables, for example, that can create performance gains. The degree of gain will vary by machine and application. On MI300A systems (Tuo, RZAdams, El Cap) and MI250X (Tioga, Tenaya), some small tests have shown order-of-magnitude performance improvements.

Getting started

To use hpc-launcher, you must first

pip install hpc-launcher

to your Python virtual environment. If you are using WEAVE, hpc-launcher is already installed.

Then, use the launch command in lieu of flux run or srun in your batch submission script.

If you are on a system running Flux (including all AMD GPU systems), the following two commands make the same request for resources and run the same script, with the second using hpc-launcher:

flux run -N 2 --tasks-per-node=2 --gpus-per-task=1 --exclusive python3 -u dist-train-flux.py


launch --scheduler flux -N 2 -n 2 --gpus-per-proc 1 --exclusive --comm-backend rccl -v python3 -u dist-train-flux.py

If you are on a system running Slurm, the following two commands make the same request for resources and run the same script, with the latter using hpc-launcher:

srun --export=ALL -N 2 --ntasks-per-node=2 --gpus-per-task=1 --exclusive python3 -u dist-train-slurm.py


launch --scheduler slurm -N 2 -n 2 --gpus-per-proc 1 --exclusive --comm-backend nccl -v python3 -u dist-train-slurm.py

Example scripts provided soon.