These examples walk through building a Fortran code which uses OpenMP to build for AMD GPUs. The examples used here are based on a Rush Larsen algorithm, an algorithm for solving differential equations on a GPU. We walk through both a basic serial example as well as an MPI-enabled example of the same code.
Note that HIP kernels (aka AMD GPU kernels) cannot be written in Fortran. However, the rest of the HIP API (malloc, launching kernels, etc.) can be called from fortran code.
Source Files
You can download the source files and try these steps out for yourself. The code here is taken from the LLNL Goulash Project.
Programming for AMD GPUs with OpenMP
Although the HIP API can be called from Fortran codes, all HIP kernels must be written in C++. Thus, OpenMP is the way to effectively use the GPUs from Fortran codes.
Verify GPU works with OpenMP
rush_larsen_gpu_omp_fort.F90:554
! If using OpenMP offloading, make sure GPU works before doing test subroutine verify_gpu_openmp(gpu_id) use omp_lib integer, intent(in) :: gpu_id character(50) :: mpi_desc="" ! If using GPU, make sure GPU OpenMP gpu offloading works before doing test integer:: runningOnGPU if (rank == 0) then call get_timestamp_string(timestamp) print '(a," Selecting GPU ",i0, " as default device",a)', trim(timestamp), gpu_id, trim(mpi_desc) flush(stdout) end if ! Pick GPU to use to exercise selection call call omp_set_default_device(gpu_id) if (rank == 0) then call get_timestamp_string(timestamp) print '(a," Launching OpenMP GPU test kernel",a)', trim(timestamp), trim(mpi_desc) flush(stdout) end if ! Test if GPU is available using OpenMP4.5 legal code runningOnGPU = 0 !$omp target map(from:runningOnGPU) if (.not. omp_is_initial_device()) then runningOnGPU = 1 else runningOnGPU = 2 end if !$omp end target ! If still running on CPU, GPU must not be available, punt if (runningOnGPU .ne. 1) then call get_timestamp_string(timestamp) print '(a," ", a, i0," ",a)', trim(timestamp), & & "ERROR: OpenMP GPU test kernel did NOT run on GPU ", gpu_id, trim(variant_desc) flush(stdout) call die() end if if (rank == 0) then call get_timestamp_string(timestamp) print '(a," Verified OpenMP target test kernel ran on GPU",a)', trim(timestamp), trim(mpi_desc) flush(stdout) end if end subroutine verify_gpu_openmp
Map CPU data to GPUs with OpenMP
rush_larsen_gpu_omp_fort.F90:284
!$omp target enter data map(to: m_gate(0:nCells-1)) !$omp target enter data map(to: Vm(0:nCells-1)) !$omp target enter data map(to: Mhu_a(0:14)) !$omp target enter data map(to: Tau_a(0:18))
OpenMP Kernel Execution
rush_larsen_gpu_omp_fort.F90:333
! Target GPU with OpenMP, data already mapped to GPU !$omp target teams distribute parallel do simd private(ii,x,sum1,j,sum2,k,mhu,tauR) do ii=0,nCells-1 x = Vm(ii) sum1 = 0.0 do j = Mhu_m-1, 0, -1 sum1 = Mhu_a(j) + x*sum1 end do sum2 = 0.0 k = Mhu_m + Mhu_l - 1 do j = k, Mhu_m, -1 sum2 = Mhu_a(j) + x * sum2 end do mhu = sum1/sum2 sum1 = 0.0 do j = Tau_m-1, 0, -1 sum1 = Tau_a(j) + x*sum1 end do tauR = sum1 m_gate(ii) = m_gate(ii) + (mhu - m_gate(ii))*(1-exp(-tauR)) end do ! End Target GPU with OpenMP, data already mapped to GPU !$omp end target teams distribute parallel do simd
Free GPU Memory
rush_larsen_gpu_omp_fort.F90:400
! Free kernel GPU memory !$omp target exit data map(delete: m_gate(0:nCells-1)) !$omp target exit data map(delete: Vm(0:nCells-1)) !$omp target exit data map(delete: Mhu_a(0:14)) !$omp target exit data map(delete: Tau_a(0:18))
Compiling
It is highly recommended that users working with GPUs do so on a backend (aka compute) node. There are known issues with running GPU codes which are most easily fixed by rebooting a node. This is much easier to do with a compute node, rather than a login node.
You get easily get your own compute node, reserved for 2 hours, with:
salloc -N 1 -t 120 -p pdev
or using flux
flux --parent alloc --nodes=1 --queue=pdev --time-limit=7200s
Using crayftn magic module
This example relies on the compiler as provided by the LC magic modules (see TODO LINK LC Magic Modules Guide). The offload-arch flag depends on the underlying GPU. For El Cap (mi300a), use gfx942 and for the EAS3 systems (mi250x), use gfx90a.
$ module load cce/18.0.0-magic $ crayftn '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_fort.F90 -o rush_larsen_gpu_omp_fort $ readelf -a rush_larsen_gpu_omp_fort | grep PATH 0x000000000000000f (RPATH) Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]
Using MPI with crayftn magic module
This example relies on the compiler as provided by the LC magic modules (see TODO LINK LC Magic Modules Guide). We pass all the same flags as above.
$ module load cce/18.0.0-magic $ module load cray-mpich # Usually already loaded by default $ mpicrayftn '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort $ readelf -a rush_larsen_gpu_omp_mpi_fort | grep PATH 0x000000000000000f (RPATH) Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib:/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]
Recommended Use of XPMEM and GTL Libraries
As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library. These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off.
See additional details and documentation on the known issues page.
A compile of the above example MPI program with the magic wrappers for 8.1.30 now expands to the following (-vvvv shows this), including adding the necessary GPU libraries since the GTL library needs them:
mpicrayftn -vvvv '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort + exec /opt/cray/pe/cce/18.0.0/bin/crayftn '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -Wl,-rpath,/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib -lxpmem -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -lmpifort_cray -lmpi_cray -Wl,--disable-new-dtags --craype-prepend-opt=-Wl,-rpath,/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64 -L/opt/rocm-6.1.2/hip/lib -L/opt/rocm-6.1.2/lib -L/opt/rocm-6.1.2/lib64 -Wl,-rpath,/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64 -lamdhip64 -lhsakmt -lhsa-runtime64 -lamd_comgr