Using El Capitan Systems: Known Issues

This page summarizes the known issues that users must be aware of when running on the El Capitan Systems.

XPMEM and GTL Libraries required for performance
Version compatibility issues with cce, cray-mpich, and rocm
Incorrect behavior of hipGetLastError routine
Known issues for debugging tools

Recommended Use of XPMEM and GTL Libraries

Issue known as of August 2024. Additional context can be found on the Compilers and User Environments and LC Magic Modules Guide pages.

As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library with something of the form -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib (and the appropriate rocm libraries for the GPU).

These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off by passing these flags to the magic MPI link lines: --no-xpmem --no-gtl.

These two libraries both accelerate MPI performance:

The xpmem library greatly accelerates cpu buffer MPI on-node performance.
The mpi_gtl_hsa accelerates GPU-buffer MPI transfers on node when the mpi buffer is allocated with hipMalloc and MPICH_GPU_SUPPORT_ENABLED is set to 1. The linked in GTL library will not be used by MPI unless the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.

Compile lines for the C++ Rush Larsen example MPI program and Fortran Rush Larsen example MPI program with the magic wrappers for 8.1.30 can be found on their respective pages.

Version Compatibility Issues with cce, cray-mpich, and rocm

Issue relevant as of August 2024. Additional context can be found on the Cray Modules Guide page.

HPE provides a very narrow statement of compatibility with regards to which versions of cce, rocm, and cray-mpich are guaranteed to function and be performant. The following version combinations are officially supported:

cce	cray-mpich	rocm
cce/18.0.0	cray-mpich/8.1.30	rocm/6.1.2 rocm/6.2.0 (expected compatibility)
cce/17.0.1	cray-mpich/8.1.29	rocm/6.0.3 rocm/6.1.2 (expected compatibility)

cce

cray-mpich

rocm

cce/18.0.0

cray-mpich/8.1.30

rocm/6.1.2

rocm/6.2.0 (expected compatibility)

cce/17.0.1

cray-mpich/8.1.29

rocm/6.0.3

rocm/6.1.2 (expected compatibility)

Evidence of Incompatibility

Most often, incompatibility between versions will be seen in:

Fortran compilation errors
Poor performance of gpu-aware MPI

Fortran users must be particularly aware of the versions of cce and cray-mpich. When using the wrong version combinations, many errors appear at compile time and are triggered by a 'use mpi' statement in fortran code. For example:

use mpi
      ^  
ftn-1777 ftn: ERROR RUSH_LARSEN_GPU_OMP_MPI_FORT, File = rush_larsen_gpu_omp_mpi_fort.F90, Line = 127, Column = 7
  File "/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include/MPI.mod" contains modules and/or submodules.  The compiler being
 used is older than the compiler that created this file.  The file was created with version 119 from release 17.0.
... many many many more errors ...

Know Issue with hipGetLastError

You MUST Call GetLastError Immediately After Kernel Launches!

Issue relevant as of August 2024. Additional context can be found on the GPU Programming page.

The current behavior of hipGetLastError and hipPeekAtLastError is to check the status of only the immediately preceding HIP API call. This is in contrast to cudaGetLastError which "returns the last error that has been produced by any of the runtime calls"

Example HIP behavior:

HIP call 1: hipSuccess
HIP call 2: hipError_t
HIP call 3: hipSuccess

hipGetLastError() returns hipSuccess

Silent errors may be occurring in many applications!

Example CUDA Behavior:

CUDA call 1: cudaSuccess
CUDA call 2: cudaError
CUDA call 3: cudaSuccess

cudaGetLastError() returns cudaError

Thus, users are highly encouraged to call hipGetLastError immediately after every kernel launch.

Correct Code Example

The following is a code snippet, demonstrating the use of hipGetLastError immediately after a hipLaunchKernel function.

hipLaunchKernelGGL(rush_larsen_gpu_kernel, dim3(gridSize), dim3(blockSize), 0, 0, gpu_m_gate, nCells, gpu_Vm);
if (hipGetLastError() !=  hipSuccess) {
  ...
}

Known Issues for Debugging Tools

There are several known issues workarounds documented on the debugging tools page. Current known issues are:

Compatibility between HPE Cray debugging and flux
Launching a gdb4hpc debugging session fails with ssh authentication failures
Several LLVM Address Sanitizer (ASAN) issues