Using El Capitan Systems: Pro Tips

This pages has some useful tips and essential information for users getting started on the El Capitan, Tuolumne, and RZAdams systems.

Allocate Interactive Nodes for Compiling, Testing, and Debugging

It is highly recommended that users always allocate a pdebug or patch compute node for compiling and testing their code. It is against LC good-neighbor policy to run large, parallel compilations or applications on the login nodes. Crashing or OOM-ing a login can kill running jobs on the system and adversely affect the login node GPUs.

Working with MI300A APU Nodes

In addition to the guidance on the GPU Programming page, the following recommendations have worked for early users and yielded the best performance for most folks.

4 Sockets Per Node

Each node can be divided into 4 sockets, each with:

1 MI300A GPU,
21 available user CPU cores, with 3 cores reserved for the OS (2 hardware threads per core)
128GB HBM3 memory (a single NUMA domain)

GPUs, CPUs, OS, and RAM disks all share the same HBM3 memory in a unified address space. NOTE Accessing memory on a different socket will negatively impact performance.

We recommend that users utilize 4 MPI ranks per node, each using 1 GPU with memory and cores from the same socket via binding.

84 Available CPU Cores

84 (out of a total of 96) CPU cores are dedicated to user processes. The remaining 12 cores are reserved for system and lustre processes. The 84 cores to bind are specified by the MPIBIND_RESTRICT environment variable, which defaults to:

MPIBIND_RESTRICT=1-7,9-15,17-23,25-31,33-39,41-47,49-55,57-63,65-71,73-79,81-87,89-95,97-103,105-111,113-
119,121-127,129-135,137-143,145-151,153-159,161-167,169-175,177-183,185-191

By un-setting the MPIBIND_RESTRICT variable, users can access all 96 cores. This is not recommended and can lead to increased noise and performance jitter at scale.

Binding is Critical

Poor binding can cause a code to run 3x slower than expected. Make sure to use the exclusive flag for all flux run commands.

flux run --exclusive ...
flux run -x ...

Memory allocations should be made using the hipMalloc function, which gives great binding and page size settings automatically.

Multiple MPI Tasks per GPU

While some users have been able to successfully run with 2- or 4- MPI tasks per GPU (i.e., per socket), there is, at best, a 10% overhead. If things are not configured correctly, performance may fall off due to GPU context switching. Please contact the LC hotline for additional details and guidance on environment variable configuration.

Flux Tips

Flux is a new resource scheduler for the El Capitan systems. Here are some helpful tips for working with it, especially for experienced LC users.

Use the Exclusive Flag

Most users will want to use the exclusive flag for any flux run commands, with either:

flux run --exclusive
flux run -x

The srun wrapper will automatically add this flag.

The exclusive indicates to the scheduler that nodes should be exclusively allocated to this job. This flag also tells flux to optimally divide node resources between tasks for performance using mpibind. In general, we recommend 4 (or multiples of 4) tasks per node, which are then bound to the 4 sockets.

Users who are running UQ- or regression-type pipelines may wish to omit this flag and utilize flux's advanced scheduling features.

Specifying Nodes and Tasks

Users should specify the number of nodes and tasks for their job with one of:

--nodes=# and --ntasks=# (short options -N # and -n #, respectively).
--tasks-per-node=#

Users should NOT specify --cores-per-task=# (-c #) nor --gpus-per-task=# (-g #), unless they are doing UQ- or regression-type testing. These options may yield poorly performing bindings.

Avoiding Bad Nodes

If a particular compute node appears to have an issue, please report it to the LC hotline.

To indicate that a set of nodes should be avoided by flux when scheduling your job, you can use the following flag on your flux batch or flux alloc command:

--requires=-host:nodename,nodename...

NOTE There is a '-' (minus symbol) in front of the host key word.

It is not recommended that users specify a required set of nodes, using the --requires flag with host: (rather than -host:). These constraints put a high load on the flux scheduler. If using the inclusive requires format, users must specify every exact node they are requesting.

As of January 2025, there is no way to require or exclude nodes via the salloc and sbatch wrappers.

Improving Performance

There are a number of libraries and configurations that can improve performance, particularly when managing memory allocations.

GPU-Aware MPI and xpmem

As of August 2024, we are recommending that users always link their application with -lxpmem and the GTL library. These recommended link modifications are done automatically with the -magic wrappers for cray-mpich/8.1.30 (and later), but can be turned off.

See additional details and documentation on the known issues page.

Memory Allocations

There are many complications that can arise from incorrect memory allocations, particularly when sharing memory between CPU and GPU processes. The recommendations below have given most users the best performance outcomes.

2 MB Pages

The MI300A GPUs perform best with 2 MB pages and requires that pages touched by the GPU are mapped into the GPU. To maximize performance, users should use hipMalloc to allocate memory and ensure the 2 MB pages are properly mapped.

Sharing CPU Memory Allocations

While CPUs and GPUs share a memory space, CPU-based memory allocations will not automatically map onto the GPU. By setting the environment variable HSA_XNACK=1, CPU page will page-fault onto the GPU (with slight overhead).

Please note that the CPU defaults to a 4 KB page size, which can cause a 15% performance overhead on the GPU (due to the GPU TLB size).

Transparent Huge Pages

By enabling transparent huge pages, most CPU memory allocations larger than 2 MB will have 2 MB pages. Users must enable transparent huge pages at compute node allocation, with either of the following:

flux alloc --setattr=thp=always ...
salloc --thp=always ...

This is highly recommended if allocating memory on the CPU that will be accessed by the GPU (users must also use HSA_XNACK=1 as described above).

Ensure 2 MB Pages for Small Allocations

By linking in the libhugetlbfs library and enabling at allocation, users ensure that CPU memory allocations smaller than 2 MB are mapped into 2 MB pages. Users must:

Link applications with -lhugetlbfs

Request a compute node allocation with either

flux alloc --setattr=hugepages=512GB ...
salloc --hugepages=512GB ...

Set the environment variables HUGETLB_MORECORE=yes and HSA_XNACK=1

Note that this feature can coexist with transparent huge pages as described above.