This page summarizes the known issues that users must be aware of when running on the El Capitan Systems.
- XPMEM and GTL Libraries required for performance
- Version compatibility issues with cce, cray-mpich, and rocm
- Incorrect behavior of hipGetLastError routine
- Known issues for debugging tools
Recommended Use of XPMEM and GTL Libraries
Issue known as of August 2024. Additional context can be found on the Compilers and User Environments and LC Magic Modules Guide pages.
As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library with something of the form -L/opt/cray/pe/mpich/8.1.30/gtl/lib -lmpi_gtl_hsa -Wl,-rpath,/opt/cray/pe/mpich/8.1.30/gtl/lib (and the appropriate rocm libraries for the GPU).
These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off by passing these flags to the magic MPI link lines: --no-xpmem --no-gtl.
These two libraries both accelerate MPI performance:
-
The xpmem library greatly accelerates cpu buffer MPI on-node performance.
-
The mpi_gtl_hsa accelerates GPU-buffer MPI transfers on node when the mpi buffer is allocated with hipMalloc and MPICH_GPU_SUPPORT_ENABLED is set to 1. The linked in GTL library will not be used by MPI unless the env variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
Compile lines for the C++ Rush Larsen example MPI program and Fortran Rush Larsen example MPI program with the magic wrappers for 8.1.30 can be found on their respective pages.
Version Compatibility Issues with cce, cray-mpich, and rocm
Issue relevant as of August 2024. Additional context can be found on the Cray Modules Guide page.
HPE provides a very narrow statement of compatibility with regards to which versions of cce, rocm, and cray-mpich are guaranteed to function and be performant. The following version combinations are officially supported:
cce |
cray-mpich |
rocm |
---|---|---|
cce/18.0.0 |
cray-mpich/8.1.30 |
rocm/6.1.2 rocm/6.2.0 (expected compatibility) |
cce/17.0.1 |
cray-mpich/8.1.29 |
rocm/6.0.3 rocm/6.1.2 (expected compatibility) |
Evidence of Incompatibility
Most often, incompatibility between versions will be seen in:
-
Fortran compilation errors
-
Poor performance of gpu-aware MPI
Fortran users must be particularly aware of the versions of cce and cray-mpich. When using the wrong version combinations, many errors appear at compile time and are triggered by a 'use mpi' statement in fortran code. For example:
use mpi ^ ftn-1777 ftn: ERROR RUSH_LARSEN_GPU_OMP_MPI_FORT, File = rush_larsen_gpu_omp_mpi_fort.F90, Line = 127, Column = 7 File "/opt/cray/pe/mpich/8.1.30/ofi/cray/18.0/include/MPI.mod" contains modules and/or submodules. The compiler being used is older than the compiler that created this file. The file was created with version 119 from release 17.0. ... many many many more errors ...
Know Issue with hipGetLastError
You MUST Call GetLastError Immediately After Kernel Launches!
Issue relevant as of August 2024. Additional context can be found on the GPU Programming page.
The current behavior of hipGetLastError and hipPeekAtLastError is to check the status of only the immediately preceding HIP API call. This is in contrast to cudaGetLastError which "returns the last error that has been produced by any of the runtime calls"
Example HIP behavior:
HIP call 1: hipSuccess HIP call 2: hipError_t HIP call 3: hipSuccess hipGetLastError() returns hipSuccess
Silent errors may be occuring in many applications!
Example CUDA Behavior:
CUDA call 1: cudaSuccess CUDA call 2: cudaError CUDA call 3: cudaSuccess cudaGetLastError() returns cudaError
Thus, users are highly encouraged to call hipGetLastError immediately after every kernel launch.
Correct Code Example
The following is a code snippet, demonstrating the use of hipGetLastError immediately after a hipLaunchKernel function.
hipLaunchKernelGGL(rush_larsen_gpu_kernel, dim3(gridSize), dim3(blockSize), 0, 0, gpu_m_gate, nCells, gpu_Vm); if (hipGetLastError() != hipSuccess) { ... }
Known Issues for Debugging Tools
There are several known issues workarounds documented on the debugging tools page. Current known issues are:
- Compatibility between HPE Cray debugging and flux
- Launching a gdb4hpc debugging session fails with ssh authentication failures
- Several LLVM Address Sanitizer (ASAN) issues