Corona is LC's only system with AMD MI50 GPUs, and PyTorch instructions for Corona differ than for our other AMD GPU systems. For more info on MI250X or MI300A systems (Tioga, Tuolumne, RZAdams, RZVernal, El Capitan, and Tenaya), please see our PyTorch on AMD GPU Systems Quickstart Guide.
Installing PyTorch
We recommend that you install PyTorch to a virtual environment on Corona.
Recommended steps:
- Load either the python/3.11.5 or python/3.12.2 module.
- Create and activate virtual environment based on this module
- Use python3 -m venv <directory>
- Do not use --system-site-packages
- Install torch using the --index-url https://download.pytorch.org/whl/rocm6.3 flag
module load python/3.12.2 python3 -m venv env source env/bin/activate pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
Test this worked:
Quick checks
To check whether PyTorch is installed and whether GPUs are visible, run the following command from the command line :
python -c 'import torch ; print(torch.rand(5, 3)) ; print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'On a node with GPUs, output should look something like:
tensor([[0.4425, 0.5630, 0.2808],
[0.9111, 0.6407, 0.5350],
[0.2498, 0.1461, 0.2608],
[0.2932, 0.2515, 0.8291],
[0.5880, 0.4187, 0.7626]])
Torch Version 2.9.1+rocm6.3
GPU available: TrueMulti-node tests
You can test your ability to run PyTorch on multiple nodes at once by grabbing a multinode allocation
flux alloc -N2 -n8 # You can add other flags like `-t10` or `--queue=pdebug`
Within the allocation, run broadcast-no-mpi.py via
MASTER_ADDR=$(flux hostlist --nth=0 --expand instance) MASTER_PORT=23457 # any free port on MASTER_ADDR flux run -N 2 --tasks-per-node=4 --exclusive \ --env=MASTER_ADDR=$MASTER_ADDR \ --env=MASTER_PORT=$MASTER_PORT \ python3 -u broadcast-no-mpi.py
This should yield output like
Rank 0 on node corona171 Rank 4/8 on corona174: tensor = 28 Rank 7/8 on corona174: tensor = 28 Rank 6/8 on corona174: tensor = 28 Rank 5/8 on corona174: tensor = 28 Rank 3/8 on corona171: tensor = 28 Rank 2/8 on corona171: tensor = 28 Rank 1/8 on corona171: tensor = 28 Rank 0/8 on corona171: tensor = 28
Next, in the same allocation, try running dist-train-flux-no-mpi.py via
flux run -N 2 --tasks-per-node=2 --exclusive \ --env=MASTER_ADDR=$MASTER_ADDR \ --env=MASTER_PORT=$MASTER_PORT \ python3 -u dist-train-flux-no-mpi.py
which should yield results something like
Rank 3 of 4 has been initialized. Rank 2 of 4 has been initialized. Rank 1 of 4 has been initialized. Rank 0 of 4 has been initialized. ngpus per node 4 ngpus per node 4 Rank: 2 on host corona174 has local_rank: 2 cuda device 2 Rank: 3 on host corona174 has local_rank: 3 cuda device 3 ngpus per node 4 Rank 0 on node corona171 ngpus per node 4 Rank: 0 on host corona171 has local_rank: 0 cuda device 0 Rank: 1 on host corona171 has local_rank: 1 cuda device 1 Starting Training Loop... Epoch 000 Loss 0.8557 Time 11.0630 Epoch 001 Loss 1.0896 Time 8.7586 Epoch 002 Loss 1.1438 Time 8.6855 Epoch 003 Loss 1.0489 Time 9.5175 Epoch 004 Loss 1.0703 Time 10.0351 Epoch 005 Loss 1.0977 Time 11.4351 Epoch 006 Loss 1.1734 Time 11.0008 Epoch 007 Loss 1.1849 Time 10.5212 Epoch 008 Loss 1.3266 Time 9.6513 Epoch 009 Loss 1.3028 Time 10.0834
Launching a Jupyter Notebook
To make your virtual environment (with torch installed) available via a Jupyter Notebook, first install ipykernel via
pip install ipykernel
Then, run the following command:
python -m ipykernel install --prefix=$HOME/.local/ --name 'coronatorchkernel' --display-name 'Corona Torch kernel'
The name and display-name arguments can be customized to meet your own needs. For most personal-use installations, these are the only commands you should need to run to make the torch kernel available to a Jupyter Notebook.
Additional details and setup information can be found on the Orbit and Jupyter Notebooks page.
