These examples walk through building a C++ code which uses HIP to build for AMD GPUs. The examples used here are based on a Rush Larsen algorithm, an algorithm for solving differential equations on a GPU. We walk through both a basic serial example as well as an MPI-enabled example of the same code.

Source Files

You can download the source files and try these steps out for yourself. The code here is taken from the LLNL Goulash Project.

Programming for AMD GPUs with HIP

HIP is a high-level programming framework. It can be used to program both for ROCm and CUDA devices. 

Rush Larsen Kernel

Some useful error checking (rush_larsen_gpu_hip.cc:126):

#define HIPCHECK(x) {hipError_t err=x; if (err != hipSuccess) punt("ERROR HIP failed: " rushxstr(x) ": %s", hipGetErrorString(err));}

HIP kernels can only be written in C++ (rush_larsen_gpu_hip.cc:457)

STATIC __global__ void rush_larsen_gpu_kernel(double* m_gate, const long nCells, const double* Vm)
{
    long ii = blockIdx.x*blockDim.x + threadIdx.x;
    if (ii > nCells) { return; }
 
    /* Identical contents to the loop body below */
    double sum1,sum2;
    const double x = Vm[ii];
    const int Mhu_l = 10;
    const int Mhu_m = 5;
    const double Mhu_a[] = { 9.9632117206253790e-01,  4.0825738726469545e-02,  6.3401613233199589e-04,  4.4158436861700431e-06,  1.1622058324043520e-08,  1.0000000000000000e+00,  4.0568375699663400e-02,  6.4216825832642788e-04,  4.2661664422410096e-06,  1.3559930396321903e-08, -1.3573468728873069e-11, -4.2594802366702580e-13,  7.6779952208246166e-15,  1.4260675804433780e-16, -2.6656212072499249e-18};
 
    sum1 = 0;
    for (int j = Mhu_m-1; j >= 0; j--)
        sum1 = Mhu_a[j] + x*sum1;
    sum2 = 0;
    int k = Mhu_m + Mhu_l - 1;
    for (int j = k; j >= Mhu_m; j--)
        sum2 = Mhu_a[j] + x * sum2;
    double mhu = sum1/sum2;
 
    const int Tau_m = 18;
    const double Tau_a[] = {1.7765862602413648e+01*0.02,  5.0010202770602419e-02*0.02, -7.8002064070783474e-04*0.02, -6.9399661775931530e-05*0.02,  1.6936588308244311e-06*0.02,  5.4629017090963798e-07*0.02, -1.3805420990037933e-08*0.02, -8.0678945216155694e-10*0.02,  1.6209833004622630e-11*0.02,  6.5130101230170358e-13*0.02, -6.9931705949674988e-15*0.02, -3.1161210504114690e-16*0.02,  5.0166191902609083e-19*0.02,  7.8608831661430381e-20*0.02,  4.3936315597226053e-22*0.02, -7.0535966258003289e-24*0.02, -9.0473475495087118e-26*0.02, -2.9878427692323621e-28*0.02,  1.0000000000000000e+00};
 
    sum1 = 0;
    for (int j = Tau_m-1; j >= 0; j--)
        sum1 = Tau_a[j] + x*sum1;
    double tauR = sum1;
    m_gate[ii] += (mhu - m_gate[ii])*(1-exp(-tauR));
}

Allocating Memory and Moving Data to the GPU

rush_larsen_gpu_hip.cc:570

double *gpu_m_gate, *gpu_Vm;
sync_starttime("Starting hipMalloc of GPU arrays\n");
HIPCHECK(hipMalloc(&gpu_m_gate, sizeof(double)*nCells));
HIPCHECK(hipMalloc(&gpu_Vm, sizeof(double)*nCells));
sync_endtime("Finished hipMalloc of GPU arrays\n");
 
transfer_starttime=sync_starttime("Starting hipMemcpy of CPU arrays to GPU arrays\n");
HIPCHECK(hipMemcpy(gpu_m_gate, m_gate, sizeof(double)*nCells, hipMemcpyHostToDevice));
HIPCHECK(hipMemcpy(gpu_Vm, Vm, sizeof(double)*nCells, hipMemcpyHostToDevice));
transfer_endtime=sync_endtime("Finished hipMemcpy of CPU arrays to GPU arrays\n");

Launching a Kernel

rush_larsen_gpu_hip.cc:618

hipLaunchKernelGGL(rush_larsen_gpu_kernel, dim3(gridSize), dim3(blockSize), 0, 0, gpu_m_gate, nCells, gpu_Vm);
HIPCHECK(hipDeviceSynchronize());

Moving Data from the GPU and Releasing GPU Memory

rush_larsen_gpu_hip.cc:636

/* Transfer GPU m_gate kernel memory to CPU kernel memory for data checks */
rank0_printf_timestamp("Starting hipMemcpy of GPU result array to CPU array\n");
HIPCHECK(hipMemcpy(m_gate, gpu_m_gate, sizeof(double)*nCells, hipMemcpyDeviceToHost));
rank0_printf_timestamp("Finished hipMemcpy of GPU result array to CPU array\n");
 
/* Do sanity and consistency checks on all of m_gate. Including cross-rank if in MPI mode.
 * Prints PASS or FAIL based on data check results
 * Returns fail count so can be returned to caller.
 */
fail_count = data_check(m_gate, iterations, kernel_mem_used, nCells);
 
/* Free kernel GPU Memory */
HIPCHECK(hipFree(gpu_Vm));
HIPCHECK(hipFree(gpu_m_gate));

Compiling

It is highly recommended that users working with HIP do so on a backend (aka compute) node. There are known issues with running HIP and the hipcc compiler which are most easily fixed by rebooting a node. This is much easier to do with a compute node, rather than a login node.

You get easily get your own compute node, reserved for 2 hours, with:

salloc -N 1 -t 120 -p pdev

or using flux

flux --parent alloc --nodes=1  --queue=pdev  --time-limit=7200s

Using hipcc (part of rocm)

The hipcc is a perl script, provided by rocm, which eventually calls amdclang under the covers. Cmake build systems are known to have issues with hipcc.

The following shows a basic compilation. First the rocm module must be loaded to get access to hipcc. Note that the created executable (a.out) is using RUNPATH, instead of RPATH. This can have unexpected consequences, particularly when different modules are loaded at run time.

$ module load rocm
$ hipcc rush_larsen_gpu_hip.cc
$ readelf -a ./a.out | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/rocm-6.2.0/lib/llvm/bin/../../../lib]

Using amdclang++ directly

First users must load a rocmcc module to get access to the amdclang++ compiler. Here, an LC magic compiler is specified (see LC Magic Modules Guide). The following flags and options must be passed to amdclang++ to get the desired behavior. The offload-arch flag depends on the underlying GPU. El Capitan systems (mi300a), use gfx942 and the EAS3 systems (mi250x), use gfx90a.

$ ml rocmcc/6.1.2-magic
$ amdclang++  -DCOMPILERID=rocmcc-6.1.2 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --offload-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -fhip-new-launch-api --driver-mode=g++ rush_larsen_gpu_hip.cc   -o rush_larsen_gpu_hip
$ readelf -a ./rush_larsen_gpu_hip | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/bin/gcc-12/lib64:/opt/rocm-6.1.2/lib]

MPI using mpiamdclang++

Again, this uses the LC magic compilers. We first load the rocmcc module to get amdclang++, then ensure that the cray-mpich module is loaded for the mpiamdclang++ compiler. We pass all the same flags as above.

$ ml rocmcc/6.1.2-magic
$ module load cray-mpich
$ mpiamdclang++  -DCOMPILERID=rocmcc-6.1.2 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --offload-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -fhip-new-launch-api --driver-mode=g++ rush_larsen_gpu_hip_mpi.cc   -o rush_larsen_gpu_hip_mpi
$ readelf -a  ./rush_larsen_gpu_hip_mpi | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/cray/pe/mpich/8.1.30/ofi/amd/6.0/lib:/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib:/opt/rocm-6.1.2/llvm/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib:/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/bin/gcc-12/lib64:/opt/rocm-6.1.2/lib]

Using the Cray CCE compiler

The cce compilers lag in features and performance when compared with the rocmcc compiler and amdclang. For the best performing HIP code, we recommend using amdclang++ for compiling C++ HIP code. The offload-arch flag depends on the underlying GPU. El Capitan systems (mi300a), use gfx942 and the EAS3 systems (mi250x), use gfx90a.

$ ml cce/18.0.0-magic
$ crayCC  -DCOMPILERID=cce-18.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --cuda-gpu-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip rush_larsen_gpu_hip.cc   -o rush_larsen_gpu_hip

MPI Using mpicrayCC

Again, this uses the LC magic compilers. We first load the cce/*-magic module to get crayCC, then ensure that the cray-mpich module is loaded for the mpicrayCC compiler. We pass all the same flags as above.

$ ml cce/18.0.0-magic
$ module load cray-mpich
$ mpicrayCC  -DCOMPILERID=cce-18.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --cuda-gpu-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip rush_larsen_gpu_hip_mpi.cc   -o rush_larsen_gpu_hip_mpi
$ readelf -a  ./rush_larsen_gpu_hip_mpi | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64:/opt/cray/pe/cce/18.0.0/cce/x86_64/lib:/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib:/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib:/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64:/opt/cray/pe/cce/18.0.0/cce-clang/x86_64/lib:/opt/cray/pe/cce/18.0.0/cce-clang/x86_64/lib]

Recommended Use of XPMEM and GTL Libraries

As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library. These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off.

See additional details and documentation on the known issues page.

A compile of the above example MPI program with the magic wrappers for 8.1.30 now expands to the following (-vvvv shows this), including adding the necessary GPU libraries since the GTL library needs them:

mpicrayCC -vvvv -DCOMPILERID=cce-18.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --cuda-gpu-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip rush_larsen_gpu_hip_mpi.cc -o rush_larsen_gpu_hip_mpi 

+ exec /opt/cray/pe/cce/18.0.0/bin/crayCC --gcc-toolchain=/opt/rh/gcc-toolset-12/root/usr -DCOMPILERID=cce-18.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.1.2/include -O3 -g --cuda-gpu-arch=gfx942 -std=c++11 --rocm-path=/opt/rocm-6.1.2 -x hip rush_larsen_gpu_hip_mpi.cc -o rush_larsen_gpu_hip_mpi -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -Wl,-rpath,/opt/cray/libfabric/2.1/lib64:/opt/cray/pe/pmi/6.1.15.6/lib:/opt/cray/pe/pals/1.2.12/lib -lxpmem -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib -I/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include -L/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/lib -lmpi_cray -Wl,--disable-new-dtags -lgcc_s --craype-prepend-opt=-Wl,-rpath,/opt/rh/gcc-toolset-12/root/usr/lib64:/usr/tce/packages/tce-wrapper-drivers/gcc-12/lib64 -L/opt/rocm-6.1.2/hip/lib -L/opt/rocm-6.1.2/lib -L/opt/rocm-6.1.2/lib64 -Wl,-rpath,/opt/rocm-6.1.2/hip/lib:/opt/rocm-6.1.2/lib:/opt/rocm-6.1.2/lib64 -lamdhip64 -lhsakmt -lhsa-runtime64 -lamd_comgr