These examples walk through building a Fortran code which uses OpenMP to build for AMD GPUs. The examples used here are based on a Rush Larsen algorithm, an algorithm for solving differential equations on a GPU. We walk through both a basic serial example as well as an MPI-enabled example of the same code.
Note that HIP kernels (aka AMD GPU kernels) cannot be written in Fortran. However, the rest of the HIP API (malloc, launching kernels, etc.) can be called from fortran code.
Source Files
You can download the source files and try these steps out for yourself. The code here is taken from the LLNL Goulash Project.
Programming for AMD GPUs with OpenMP
Although the HIP API can be called from Fortran codes, all HIP kernels must be written in C++. Thus, OpenMP is the way to effectively use the GPUs from Fortran codes.
Verify GPU works with OpenMP
rush_larsen_gpu_omp_fort.F90:554
! If using OpenMP offloading, make sure GPU works before doing test
subroutine verify_gpu_openmp(gpu_id)
use omp_lib
integer, intent(in) :: gpu_id
character(50) :: mpi_desc=""
! If using GPU, make sure GPU OpenMP gpu offloading works before doing test
integer:: runningOnGPU
if (rank == 0) then
call get_timestamp_string(timestamp)
print '(a," Selecting GPU ",i0, " as default device",a)', trim(timestamp), gpu_id, trim(mpi_desc)
flush(stdout)
end if
! Pick GPU to use to exercise selection call
call omp_set_default_device(gpu_id)
if (rank == 0) then
call get_timestamp_string(timestamp)
print '(a," Launching OpenMP GPU test kernel",a)', trim(timestamp), trim(mpi_desc)
flush(stdout)
end if
! Test if GPU is available using OpenMP4.5 legal code
runningOnGPU = 0
!$omp target map(from:runningOnGPU)
if (.not. omp_is_initial_device()) then
runningOnGPU = 1
else
runningOnGPU = 2
end if
!$omp end target
! If still running on CPU, GPU must not be available, punt
if (runningOnGPU .ne. 1) then
call get_timestamp_string(timestamp)
print '(a," ", a, i0," ",a)', trim(timestamp), &
& "ERROR: OpenMP GPU test kernel did NOT run on GPU ", gpu_id, trim(variant_desc)
flush(stdout)
call die()
end if
if (rank == 0) then
call get_timestamp_string(timestamp)
print '(a," Verified OpenMP target test kernel ran on GPU",a)', trim(timestamp), trim(mpi_desc)
flush(stdout)
end if
end subroutine verify_gpu_openmpMap CPU data to GPUs with OpenMP
rush_larsen_gpu_omp_fort.F90:284
!$omp target enter data map(to: m_gate(0:nCells-1)) !$omp target enter data map(to: Vm(0:nCells-1)) !$omp target enter data map(to: Mhu_a(0:14)) !$omp target enter data map(to: Tau_a(0:18))
OpenMP Kernel Execution
rush_larsen_gpu_omp_fort.F90:333
! Target GPU with OpenMP, data already mapped to GPU
!$omp target teams distribute parallel do simd private(ii,x,sum1,j,sum2,k,mhu,tauR)
do ii=0,nCells-1
x = Vm(ii)
sum1 = 0.0
do j = Mhu_m-1, 0, -1
sum1 = Mhu_a(j) + x*sum1
end do
sum2 = 0.0
k = Mhu_m + Mhu_l - 1
do j = k, Mhu_m, -1
sum2 = Mhu_a(j) + x * sum2
end do
mhu = sum1/sum2
sum1 = 0.0
do j = Tau_m-1, 0, -1
sum1 = Tau_a(j) + x*sum1
end do
tauR = sum1
m_gate(ii) = m_gate(ii) + (mhu - m_gate(ii))*(1-exp(-tauR))
end do
! End Target GPU with OpenMP, data already mapped to GPU
!$omp end target teams distribute parallel do simdFree GPU Memory
rush_larsen_gpu_omp_fort.F90:400
! Free kernel GPU memory !$omp target exit data map(delete: m_gate(0:nCells-1)) !$omp target exit data map(delete: Vm(0:nCells-1)) !$omp target exit data map(delete: Mhu_a(0:14)) !$omp target exit data map(delete: Tau_a(0:18))
Compiling
It is highly recommended that users working with GPUs do so on a backend (aka compute) node. There are known issues with running GPU codes which are most easily fixed by rebooting a node. This is much easier to do with a compute node, rather than a login node.
You get easily get your own compute node, reserved for 2 hours, with:
flux alloc -N 1 -t 2h -q pdev
or using the Slurm wrapper:
salloc -N 1 -t 120 -p pdev
Using crayftn magic module
This example relies on the compiler as provided by the LC magic modules (see LC Magic Modules Guide). The offload-arch flag depends on the underlying GPU. For El Cap (mi300a), use gfx942 and for the EAS3 systems (mi250x), use gfx90a.
$ module load cce/18.0.0-magic $ crayftn -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_fort.F90 -o rush_larsen_gpu_omp_fort $ readelf -a rush_larsen_gpu_omp_fort | grep PATH 0x000000000000000f (RPATH) Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]
Using MPI with crayftn magic module
This example relies on the compiler as provided by the LC magic modules (see LC Magic Modules Guide). We pass all the same flags as above.
$ module load cce/18.0.0-magic $ module load cray-mpich # Usually already loaded by default $ mpicrayftn -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort $ readelf -a rush_larsen_gpu_omp_mpi_fort | grep PATH 0x000000000000000f (RPATH) Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib:/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]
Recommended Use of XPMEM and GTL Libraries
As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library. These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off.
See additional details and documentation on the known issues page.
A compile of the above example MPI program with the magic wrappers for 8.1.30 now expands to the following (-vvvv shows this), including adding the necessary GPU libraries since the GTL library needs them:
mpicrayftn -vvvv -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort + exec /opt/cray/pe/cce/18.0.0/bin/crayftn '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -Wl,-rpath,/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib -lxpmem -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -lmpifort_cray -lmpi_cray -Wl,--disable-new-dtags --craype-prepend-opt=-Wl,-rpath,/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64 -L/opt/rocm-6.1.2/hip/lib -L/opt/rocm-6.1.2/lib -L/opt/rocm-6.1.2/lib64 -Wl,-rpath,/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64 -lamdhip64 -lhsakmt -lhsa-runtime64 -lamd_comgr
