PyTorch on Corona Quickstart Guide

Corona is LC's only system with AMD MI50 GPUs, and PyTorch instructions for Corona differ than for our other AMD GPU systems. For more info on MI250X or MI300A systems (Tioga, Tuolumne, RZAdams, RZVernal, El Capitan, and Tenaya), please see our PyTorch on AMD GPU Systems Quickstart Guide.

Installing PyTorch

We recommend that you install PyTorch to a virtual environment on Corona.

Recommended steps:

Load either the python/3.11.5 or python/3.12.2 module.
Create and activate virtual environment based on this module
1. Use python3 -m venv <directory>
2. Do not use --system-site-packages
Install torch using the --index-url https://download.pytorch.org/whl/rocm6.3 flag

module load python/3.12.2
python3 -m venv env
source env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

Test this worked:

Quick checks

To check whether PyTorch is installed and whether GPUs are visible, run the following command from the command line :

python -c 'import torch ; print(torch.rand(5, 3)) ; print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'

On a node with GPUs, output should look something like:

tensor([[0.4425, 0.5630, 0.2808],
       [0.9111, 0.6407, 0.5350],
       [0.2498, 0.1461, 0.2608],
       [0.2932, 0.2515, 0.8291],
       [0.5880, 0.4187, 0.7626]])
Torch Version 2.9.1+rocm6.3
GPU available: True

Multi-node tests

You can test your ability to run PyTorch on multiple nodes at once by grabbing a multinode allocation

flux alloc -N2 -n8 # You can add other flags like `-t10` or `--queue=pdebug`

Within the allocation, run broadcast-no-mpi.py via

MASTER_ADDR=$(flux hostlist --nth=0 --expand instance)
MASTER_PORT=23457   # any free port on MASTER_ADDR
flux run -N 2 --tasks-per-node=4 --exclusive \
 --env=MASTER_ADDR=$MASTER_ADDR \
 --env=MASTER_PORT=$MASTER_PORT \
 python3 -u broadcast-no-mpi.py

This should yield output like

Rank 0 on node corona171
Rank 4/8 on corona174: tensor = 28
Rank 7/8 on corona174: tensor = 28
Rank 6/8 on corona174: tensor = 28
Rank 5/8 on corona174: tensor = 28
Rank 3/8 on corona171: tensor = 28
Rank 2/8 on corona171: tensor = 28
Rank 1/8 on corona171: tensor = 28
Rank 0/8 on corona171: tensor = 28

Next, in the same allocation, try running dist-train-flux-no-mpi.py via

flux run -N 2 --tasks-per-node=2 --exclusive \
 --env=MASTER_ADDR=$MASTER_ADDR \
 --env=MASTER_PORT=$MASTER_PORT \
 python3 -u dist-train-flux-no-mpi.py

which should yield results something like

Rank 3 of 4 has been initialized.
Rank 2 of 4 has been initialized.
Rank 1 of 4 has been initialized.
Rank 0 of 4 has been initialized.
ngpus per node 4
ngpus per node 4
Rank: 2 on host corona174 has local_rank: 2 cuda device 2
Rank: 3 on host corona174 has local_rank: 3 cuda device 3
ngpus per node 4
Rank 0 on node corona171
ngpus per node 4
Rank: 0 on host corona171 has local_rank: 0 cuda device 0
Rank: 1 on host corona171 has local_rank: 1 cuda device 1
Starting Training Loop...
Epoch 000 Loss 0.8557 Time 11.0630
Epoch 001 Loss 1.0896 Time 8.7586
Epoch 002 Loss 1.1438 Time 8.6855
Epoch 003 Loss 1.0489 Time 9.5175
Epoch 004 Loss 1.0703 Time 10.0351
Epoch 005 Loss 1.0977 Time 11.4351
Epoch 006 Loss 1.1734 Time 11.0008
Epoch 007 Loss 1.1849 Time 10.5212
Epoch 008 Loss 1.3266 Time 9.6513
Epoch 009 Loss 1.3028 Time 10.0834

Launching a Jupyter Notebook

To make your virtual environment (with torch installed) available via a Jupyter Notebook, first install ipykernel via

 pip install ipykernel

Then, run the following command:

python -m ipykernel install --prefix=$HOME/.local/ --name 'coronatorchkernel' --display-name 'Corona Torch kernel'

The name and display-name arguments can be customized to meet your own needs. For most personal-use installations, these are the only commands you should need to run to make the torch kernel available to a Jupyter Notebook.

Additional details and setup information can be found on the Orbit and Jupyter Notebooks page.