Using El Capitan Systems: Memory Management

There are many complications that can arise from incorrect memory allocations, particularly when sharing memory between CPU and GPU processes. The recommendations below have given most users the best performance outcomes.

Memory Limits

The MI300A APU consists of 128 GB of unified HBM3 to be used by OS services, runtimes, and user applications. The following are the recommended guidelines on how much memory is safe to use by user applications.

SPX mode (1 GPU / APU) - 118 GB / GPU
CPX mode (6 GPUs / APU) - 18 GB / GPU

Note that it is possible to exceed these soft limits by a few GB, but this is application-dependent, and doing so will increase the probability of triggering out-of-memory errors (OOM). Conversely, some applications may exhibit patterns where runtimes require more memory, and lower limits need to be considered.

Sharing CPU Memory Allocations

While CPUs and GPUs share a memory space, CPU-based memory allocations will not automatically map onto the GPU. By setting the environment variable HSA_XNACK=1, CPU pages will page-fault onto the GPU (with slight overhead).

Please note that the CPU defaults to a 4 KB page size, which can cause a 15% performance overhead on the GPU (due to the GPU TLB size).

Note also that for GPU-aware MPI to perform best in unified memory mode (HSA_XNACK=1), buffers passed to MPI, including in OpenMP offload clauses, should be allocated on the GPU with 2 MB huge pages.

Using Huge Pages

There are two main Linux kernel mechanisms for supporting allocations with huge pages:

Transparent Huge Pages (THP): The Linux kernel automatically and dynamically manages the use of huge pages for large, contiguous memory areas.
HugeTLB: A Linux kernel feature that requires allocators to explicitly use particular properties or flags to support static huge pages

Both THP and HugeTLB can be used together as they provide huge pages at different layers of the memory management unit.

GPU 2 MB Huge Pages

The MI300A GPUs perform best with 2 MB pages and require that pages touched by the GPU are mapped into the GPU. To maximize performance, applications should use device allocators to allocate memory and ensure that 2 MB pages are properly mapped.

The device allocator APIs for C/C++ are:

hipMalloc(&cptr, N * sizeof(double));
...
hipFree(cptr);
cptr = nullptr;

The device allocator APIs for Fortran with OpenMP (uses hipMalloc):

cptr = omp_target_alloc(N * C_SIZEOF(0.0_C_DOUBLE), omp_get_default_device());
CALL c_f_pointer(cptr, fptr, [N])
...
omp_target_free(cptr, omp_get_default_device())
NULLIFY(fptr)

CPU 2 MB Huge Pages via Transparent Huge Pages (THP)

Huge pages may provide performance improvements to HPC applications as fewer entries in memory caches and page tables can represent large memory regions, thus reducing memory lookups, TLB misses, and page faults. By enabling Transparent Huge Pages (THP), the Linux kernel will automatically and dynamically manage the use of 2 MB huge pages for large, contiguous memory allocations. To increase the probability of using huge pages, allocations should be larger than 2 MB and aligned to 2 MB. Users must enable THP at compute node allocation, with either of the following:

flux alloc --setattr=thp=always ...
salloc --thp=always ...

This is highly recommended if allocating memory on the CPU that will be accessed by the GPU (users must also use HSA_XNACK=1 as described above).

Note that THP will be enabled for all commands executed in the job allocation. Note also that this feature can coexist with HugeTLB huge pages as described below.

CPU 2 MB Huge Pages via HugeTLB

Huge pages may provide performance improvements to HPC applications as fewer entries in memory caches and page tables can represent large memory regions, thus reducing memory lookups, TLB misses, and page faults. By enabling HugeTLB and linking in the libhugetlbfs library, malloc calls will be intercepted and modified to request huge pages. This approach can map small CPU memory allocations, of several KB, into 2 MB huge pages. If an application makes use of many small, long-lived memory allocations, using the libhugetlbfs library can increase memory fragmentation. Several settings are needed to ensure HugeTLB support.

Link application with libhugetlbfs library, using either:
1. Linker flag -lhugetlbfs
2. Or by preloading the libhugetlbfs library for the application's command. It is discouraged to export LD_PRELOAD in the shell's environment because it will enable huge pages for all follow-up commands.
3. ```
env LD_PRELOAD=/lib64/libhugetlbfs.so <exe>
```
Request a compute node allocation with either of the following:
```
flux alloc --coral2-hugepages ...
salloc --hugepages=512GB ...
```

These options automatically set the following environment variables which are necessary to configure libhugetlbfs:

HUGETLB_DEFAULT_PAGE_SIZE=2M
HUGETLB_VERBOSE=1
HUGETLB_MORECORE=yes
HUGETLB_ELFMAP=no

WARNING Do not load any Cray hugepages modules (e.g., craype-hugepages2M) nor set other HUGETLB environment variables as they may interfere with this guidance.

This is highly recommended if allocating memory on the CPU that will be accessed by the GPU (users must also use HSA_XNACK=1 as described above).

Note that HugeTLB will only be enabled for commands run having the libhugetlbfs shared library already loaded by the dynamic linker. Note that this feature can coexist with THP as described above.