These examples walk through building a Fortran code which uses OpenMP to build for AMD GPUs. The examples used here are based on a Rush Larsen algorithm, an algorithm for solving differential equations on a GPU. We walk through both a basic serial example as well as an MPI-enabled example of the same code.

Note that HIP kernels (aka AMD GPU kernels) cannot be written in Fortran. However, the rest of the HIP API (malloc, launching kernels, etc.) can be called from fortran code.

Source Files

You can download the source files and try these steps out for yourself. The code here is taken from the LLNL Goulash Project.

Programming for AMD GPUs with OpenMP

Although the HIP API can be called from Fortran codes, all HIP kernels must be written in C++. Thus, OpenMP is the way to effectively use the GPUs from Fortran codes.

Verify GPU works with OpenMP

rush_larsen_gpu_omp_fort.F90:554

! If using OpenMP offloading, make sure GPU works before doing test
subroutine verify_gpu_openmp(gpu_id)
  use omp_lib
  integer, intent(in) :: gpu_id
 
  character(50) :: mpi_desc=""
 
  ! If using GPU, make sure GPU OpenMP gpu offloading works before doing test
  integer:: runningOnGPU
 
  if (rank == 0) then
     call get_timestamp_string(timestamp)
     print '(a," Selecting GPU ",i0, " as default device",a)', trim(timestamp), gpu_id, trim(mpi_desc)
     flush(stdout)
  end if
 
  ! Pick GPU to use to exercise selection call
  call omp_set_default_device(gpu_id)
 
  if (rank == 0) then
     call get_timestamp_string(timestamp)
     print '(a," Launching OpenMP GPU test kernel",a)', trim(timestamp), trim(mpi_desc)
     flush(stdout)
  end if
 
  ! Test if GPU is available using OpenMP4.5 legal code
  runningOnGPU = 0
  !$omp target map(from:runningOnGPU)
  if (.not. omp_is_initial_device()) then
     runningOnGPU = 1
  else
     runningOnGPU = 2
  end if
  !$omp end target
 
  ! If still running on CPU, GPU must not be available, punt
  if (runningOnGPU .ne. 1) then
     call get_timestamp_string(timestamp)
     print '(a," ", a, i0," ",a)', trim(timestamp), &
          & "ERROR: OpenMP GPU test kernel did NOT run on GPU ", gpu_id, trim(variant_desc)
     flush(stdout)
     call die()
  end if
 
  if (rank == 0) then
     call get_timestamp_string(timestamp)
     print '(a," Verified OpenMP target test kernel ran on GPU",a)', trim(timestamp), trim(mpi_desc)
     flush(stdout)
  end if
end subroutine verify_gpu_openmp

Map CPU data to GPUs with OpenMP

rush_larsen_gpu_omp_fort.F90:284

!$omp target enter data map(to: m_gate(0:nCells-1))
!$omp target enter data map(to: Vm(0:nCells-1))
!$omp target enter data map(to: Mhu_a(0:14))
!$omp target enter data map(to: Tau_a(0:18))

OpenMP Kernel Execution

rush_larsen_gpu_omp_fort.F90:333

! Target GPU with OpenMP, data already mapped to GPU
!$omp target teams distribute parallel do simd private(ii,x,sum1,j,sum2,k,mhu,tauR)
do ii=0,nCells-1
   x = Vm(ii)
   sum1 = 0.0
   do j = Mhu_m-1, 0, -1
      sum1 = Mhu_a(j) + x*sum1
   end do
   sum2 = 0.0
   k = Mhu_m + Mhu_l - 1
   do j = k, Mhu_m, -1
      sum2 = Mhu_a(j) + x * sum2
   end do
   mhu = sum1/sum2
 
   sum1 = 0.0
   do j = Tau_m-1, 0, -1
      sum1 = Tau_a(j) + x*sum1
   end do
   tauR = sum1
 
   m_gate(ii) = m_gate(ii) + (mhu - m_gate(ii))*(1-exp(-tauR))
end do
! End Target GPU with OpenMP, data already mapped to GPU
!$omp end target teams distribute parallel do simd

Free GPU Memory

rush_larsen_gpu_omp_fort.F90:400

! Free kernel GPU memory
!$omp target exit data map(delete: m_gate(0:nCells-1))
!$omp target exit data map(delete: Vm(0:nCells-1))
!$omp target exit data map(delete: Mhu_a(0:14))
!$omp target exit data map(delete: Tau_a(0:18))

Compiling

It is highly recommended that users working with GPUs do so on a backend (aka compute) node. There are known issues with running GPU codes which are most easily fixed by rebooting a node. This is much easier to do with a compute node, rather than a login node.

You get easily get your own compute node, reserved for 2 hours, with:

salloc -N 1 -t 120 -p pdev

or using flux

flux --parent alloc --nodes=1  --queue=pdev  --time-limit=7200s

Using crayftn magic module

This example relies on the compiler as provided by the LC magic modules (see TODO LINK LC Magic Modules Guide).  The offload-arch flag depends on the underlying GPU.  For El Cap (mi300a), use gfx942 and for the EAS3 systems (mi250x), use gfx90a.

$ module load cce/18.0.0-magic
$ crayftn  '-DCOMPILERID="cce-18.0.0"'  -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_fort.F90   -o rush_larsen_gpu_omp_fort
$ readelf -a rush_larsen_gpu_omp_fort | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]

Using MPI with crayftn magic module

This example relies on the compiler as provided by the LC magic modules (see TODO LINK LC Magic Modules Guide). We pass all the same flags as above.

$ module load cce/18.0.0-magic
$ module load cray-mpich # Usually already loaded by default
$ mpicrayftn  '-DCOMPILERID="cce-18.0.0"'  -O3 -g -fopenmp -haccel=amd_gfx942 rush_larsen_gpu_omp_mpi_fort.F90   -o rush_larsen_gpu_omp_mpi_fort
$ readelf -a rush_larsen_gpu_omp_mpi_fort | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib:/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib]

Recommended Use of XPMEM and GTL Libraries

As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library. These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off.

See additional details and documentation on the known issues page.

A compile of the above example MPI program with the magic wrappers for 8.1.30 now expands to the following (-vvvv shows this), including adding the necessary GPU libraries since the GTL library needs them: 

mpicrayftn -vvvv '-DCOMPILERID="cce-18.0.0"'  -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90   -o rush_larsen_gpu_omp_mpi_fort
 
+ exec /opt/cray/pe/cce/18.0.0/bin/crayftn '-DCOMPILERID="cce-18.0.0"' -O3 -g -fopenmp -haccel=amd_gfx90a rush_larsen_gpu_omp_mpi_fort.F90 -o rush_larsen_gpu_omp_mpi_fort -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -Wl,-rpath,/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib -lxpmem -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -lmpifort_cray -lmpi_cray -Wl,--disable-new-dtags --craype-prepend-opt=-Wl,-rpath,/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64 -L/opt/rocm-6.1.2/hip/lib -L/opt/rocm-6.1.2/lib -L/opt/rocm-6.1.2/lib64 -Wl,-rpath,/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64 -lamdhip64 -lhsakmt -lhsa-runtime64 -lamd_comgr