Using El Capitan Systems: GPU Programming

Please see the Pro-tips page for guidance on working with the MI300A nodes and getting the best performance when managing memory allocations.

Terminology

ROCm is a software stack for programming on AMD GPUs. Essentially, ROCm is the software library for the GPU device. Users will program in a higher-level programming framework (such as HIP or OpenMP) which is then processed by a compiler into ROCm-compatible executable. Tools for working with ROCm compatible devices/applications include roc-prof, roc-tracer, roc-smi, as well as a debugging tool and an assembler/disassembler.

HIP is a high-level programming framework. It can be used to program both for ROCm and CUDA devices. ROCm provides a compiler wrapper called hipcc (similar to NVIDIA's nvcc).

Programming with HIP

Transitioning from CUDA

In general, users can replace "hip" in any "cuda" functions.

Know Issue with hipGetLastError

You MUST Call GetLastError Immediately After Kernel Launches!

The current behavior of hipGetLastError and hipPeekAtLastError may lead to silent errors in user applications. Please see the known issue page for details on how this bug manifests and an example of correct code.

GPU-aware MPI

See the MPI section on how to enable GPU-aware MPI. This will allow users to send GPU memory buffers directly via MPI calls.

Additional Information

More information available in the AMD ROCm Learning Center.

See the slides for Programming the AMD Instinct™ MI300 APU.

MI300A APU Nodes

Each node can be divided into 4 sockets, each with:

1 MI300A GPU,
21 available user CPU cores, with 3 cores reserved for the OS (2 hardware threads per core)
128GB HBM3 memory (a single NUMA domain)

GPUs, CPUs, OS, and RAM disks all share the same HBM3 memory in a unified address space. NOTE Accessing memory on a different socket will negatively impact performance.

We recommend that users utilize 4 MPI ranks per node, each using 1 GPU with memory and cores from the same socket via binding.

Partitioning Modes

The MI300A supports the following compute partitioning modes:

SPX (Single Partition X-celerator): All GPU XCDs are grouped as a single monolithic device.
TPX (Triple Partition X-celerator): The GPU complex is divided into three partitions, each containing two XCDs.
CPX (Core Partitioned X-celerator): Each of the six XCDs is treated as a separate logical device.

While SPX mode is the default for El Capitan systems, users can request TPX and CPX modes (with each APU presenting as 3 and 6 GPUs respectively). Nodes can be requested in either of these modes using the flux options below:

#flux: --setattr=gpumode=TPX    #(or CPX)
#flux: --conf=resource.rediscover=true

The run command would be:

TPX mode: flux run -N 1 -x -n 12 -g1 ./example.out

CPX mode: flux run -N 1 -x -n 24 -g1 ./example.out

mpibind will ensure that each rank sees one of the partitions. rocm-smi can be used to confirm that the nodes are in the requested mode.

More details can be found on the AMD MI300A documentation pages.