PyTorch on AMD GPU Systems Quickstart Guide

Note The following instructions will only work on the following LC AMD systems: Tioga, Tuolumne, RZAdams, RZVernal, El Capitan, and Tenaya. Corona users, please see Corona's PyTorch Quickstart.

Quickstart

For our systems using the MI250X or MI300A, we now recommend you use the public wheels from PyTorch. A typical workflow looks like this:

$> module load python/3.13.2
$> virtualenv pytorch
$> source pytorch/bin/activate
$> pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2

These installations can run performantly on our systems, as long as you use all of the appropriate environment variables and flux settings. An example is provided below.

Test this worked:

To check whether PyTorch is installed and whether GPUs are visible, run the following command from the command line:

python -c 'import torch ; print(torch.rand(5, 3)) ; print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'

On a node with GPUs, output should look something like:

tensor([[0.0796, 0.2218, 0.8005],
        [0.7947, 0.3835, 0.9008],
        [0.8714, 0.7890, 0.6630],
        [0.6062, 0.7453, 0.7118],
        [0.7487, 0.2672, 0.4115]])
Torch Version 2.12.0+rocm7.2
GPU available: True

Running the same thing in a Python REPL gives you:

>>> import torch
>>> print(torch.rand(5, 3))
tensor([[0.2399, 0.4855, 0.4793],
     [0.0691, 0.5013, 0.8669],
     [0.9730, 0.7977, 0.2821],
     [0.1011, 0.7830, 0.1502],
     [0.6469, 0.7673, 0.8410]])
>>> print("Torch Version", torch.__version__)
Torch Version: 2.12.0+rocm7.2
>>> print("GPU available:", torch.cuda.is_available())
GPU available: True

Using PyTorch on multiple nodes

Wrapper Script: Recommended multi-node config

To get up and running quickly, use this script to wrap your training for best performance. An example training script is provided here.

Looking to the future, the LBANN team's hpc-launcher tool will be able to wrap PyTorch's torchrun executable for launching jobs on our CORAL2 systems. The tool works currently if you aren't using PyTorch's torchrun executable.

Without the wrapper: important settings to consider

For PyTorch performance across multiple nodes, you'll want to run something like

module load rocm/7.2.1 
module load rccl
pip install mpi4py==4.1.0.dev0+mpich.9.1.0 # if mpi4py is needed
export LD_LIBRARY_PATH=/collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-7.2.0/install/lib:$LD_LIBRARY_PATH

These commands are for PyTorch wheels built against rocm 7.2. Spindle will be loaded for you automatically.

More details below.

Loading ROCm

To run distributed PyTorch, load the appropriate ROCm module at runtime. For example,

module load rocm/7.2.1

If everything is set up correctly, then running broadcast-ddp.py via

flux run -N 2 --tasks-per-node=4 -q debug -t 5m --exclusive python3 broadcast-ddp.py

will yield output like

flux-job: f3sLNcuLGTEX started
Rank 3/8: tensor after all_reduce = 28
Rank 2/8: tensor after all_reduce = 28
Rank 1/8: tensor after all_reduce = 28
Rank 0/8: tensor after all_reduce = 28
Rank 7/8: tensor after all_reduce = 28
Rank 6/8: tensor after all_reduce = 28
Rank 5/8: tensor after all_reduce = 28
Rank 4/8: tensor after all_reduce = 28

RCCL module & RCCL-OFI plug-in

When scaling PyTorch across multiple nodes via the Cray Slingshot network, you must load

module load rccl

which defaults to module load rccl/working-env. By instead selecting module load rccl/fast-env-slows-mpi, you will gain additional RCCL performance at the cost of degraded MPI performance.

Separately, getting multi-node performance requires a plugin that lets RCCL use the libfabric library. These plugin libraries are available in /collab/usr/global/tools/rccl. Adding these to LD_LIBRARY_PATH will enable the plugin when you run PyTorch.

MPI4Py users

MPI4Py users are recommended to install one of our wheels provided here; those compatible with your python version will show with a git hash including `dev0` in `pip index versions --pre mpi4py` output.

For example,

pip install mpi4py==4.1.0.dev0+mpich.9.1.0

Spindle

For multi-node jobs, LC highly recommends using Spindle to accelerate Python library loading. Spindle is already on by default for El Capitan and Tuo, but needs to be manually set for other systems, including RZAdams, RZVernal, and Tioga.

Using PyTorch from within a Jupyter notebook

Please use the docs Orbit and Jupyter notebooks to create a Jupyter kernel from your python virtual environment. In particular, after creating your virtual environment as described above, you will need to

Install `ipykernel` to your virtual environment
Install your custom kernel to `~/.local`
Manually update LD_LIBRARY_PATH in `kernel.json`.

pip install ipykernel
python -m ipykernel install --prefix=$HOME/.local --name 'mytorchenv' --display-name 'mytorchenv'
echo $LD_LIBRARY_PATH

Use the output of `echo $LD_LIBRARY_PATH` to update `$HOME/.local/share/jupyter/kernels/<yourKernelName>/kernel.json` as shown in the "Custom Kernel ENV" section of Orbit and Jupyter notebooks. Your definition for "env" in kernel.json might look like this:

  "env": {
  "LD_LIBRARY_PATH": "/opt/cray/pe/lib64:/opt/cray/lib64:/opt/cray/pe/papi/7.2.0.2/lib64:/opt/cray/libfabric/2.1/lib64:${LD_LIBRARY_PATH}"
},