Using El Capitan Systems: Debugging Tools

As described on this page, LC provides a number of tool for debugging on the El Capitan systems:

HPE Cray Debugging tools
AMD ROCm Tools
LLVM Address Sanitizer (ASAN) for the GPUs
TotalView for AMD GPUs
STAT: Stack Trace Analysis Tool

HPE Cray Debugging Tools

The Cray Debugging Support Tool Suite is available for use on the El Capitan systems. These tools are tested with multiple workload managers, including flux, but the rapid development of flux on LC systems is resulting in some functional issues. The available tools are:

Available tools

The Cray Programming Environment provides several tools:

gdb4hpc: A command line driver parallel harness over gdb or rocgdb. Provides a gdb like interface with aggregation from across your job.
sanitizers4hpc: An aggregator for address sanitizer output in parallel jobs.
STAT: A stack trace analysis tool that allows you to attach to running parallel jobs, sample call stacks, and aggregate into a meaningful parallel process tree. (An LLNL developed tool, but bugs can be reported against cray-stat to HPE as well)
CCDB (Do not use with web server interface) The Cray Comparative Debugger: a graphical tool that can debug two processes simultaneously, comparing output. The most recent version extends gdbgui with the features of gdb4hpc, making it a full graphical parallel debugger.

Known Issues with Workarounds

Compatibility between Cray tools and flux

Several workaround are required to allow the current Cray tools to work with the bleeding edge flux on the El Capitan systems. They are:

Tools, in particular gdb4hpc, must be run inside a flux alloc session. Running outside a flux session is not supported at this time.
Set the environment variable CTI_FLUX_DEBUG=1 to tell the tools that they can launch even though the flux version on the system is much newer than they expect.

export CTI_FLUX_DEBUG=1

The Cray tools expect a different flux directory structure than currently on the LC systems. In particular, the tools expect part of the path to be lib instead of lib64. In the meantime, fake a directory structure for flux:

mkdir ${PWD}/my_dummy_flux
ln -s /usr/lib64 ${PWD}/my_dummy_flux/lib
export ${FLUX_INSTALL_DIR}=${PWD}/my_dummy_flux

If you use RSA ssh keys (the normal default), you may encounter authentication errors on launch. See the issue below.

Launching a gdb4hpc debugging session fails with ssh authentication failures

When launching a debugging session, you may see errors such as the following:

Network creation failed: cti_execToolDaemon:  Failed to authenticate using the username <your-name>, SSH public key file at <your-ssh-home>/id_rsa.pub and private key file at <your-ssh-home>/id_rsa . If these paths are not correct, they can be overridden by setting the environment variables CTI_SSH_PUBKEY_PATH and CTI_SSH_PRIKEY_PATH . If a passphrase is required to unlock the keys, it can be provided by setting the environment variable CTI_SSH_PASSPHRASE (Username/PublicKey combination invalid)

On flux, the current version of the Cray Common Tools Interface (CTI) 2.17.2 uses ssh to ship packages to the compute nodes for the tools to use. A future version of CTI will use native flux capabilities (for example, sbcast is used on slurm).

The library CTI 2.17.2 uses to interact with ssh is not able to authenticate using RSA keys on TOSS4. Until CTI is updated, either to use a newer ssh library or to use flux to distribute tooling packages, ecdsa SSH keys are required for the debugging tools on flux:

ssh-keygen -t ecdsa
# Follow the prompts and put it wherever you want, just remember it
# I just had it go to the default place
cat ~/.ssh/id_ecdsa.pub >> ~/.ssh/authorized_keys
export CTI_SSH_PRIKEY_PATH=${HOME}/.ssh/id_ecdsa
export CTI_SSH_PUBKEY_PATH=${HOME}/.ssh.id_ecdsa.pub

AMD ROCm Tools

The AMD ROCm Platform is extensively documented, and El Capitan users may have particular interest in the ROCm Tools, Compilers, and runtimes. This page highlights and links to several of the available tools.

Tools and Documentation Links

ROCdbgapi - The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD’s commercially available GPU architectures.
ROCm Compilers - A Clang/LLVM-based compiler project. These compilers are optimized for high-performance computing on AMD GPUs and CPUs and support various heterogeneous programming models such as HIP, OpenMP, and OpenCL.
ROCgdb - This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
ROCprofiler - ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics.
ROCtracer - Callback/Activity Library for Performance tracing AMD GPUs

Address Sanitizer (ASAN) on the GPUs

The LLVM Address Sanitizer (ASAN) detects runtime addressing errors in applications and libraries. Originally designed to detect errors in CPU applications, ROCm has extended it to detect some errors in the GPU.

Compiling with ASAN

Using ASAN requires compiling the application with instrumentation. The tool works by instrumenting memory accesses and checking at runtime if an access violation occurred.

The following examples and setting were tested in Tioga with rocmcc/5.7.0beta1-magic.

Follow these to compile with ASAN:

Add the following options to the compiler and linker options:
- '-fsanitize=address': enable instrumentation
- '-shared-libsan': use shared version of runtime
- '-g': add debug info for improved reporting
Explicitly use 'xnack+' in the offload architecture option. For example, '--offload-arch=gfx90a:xnack+'

Compilation Example

$ hipcc -std=c++11 -O0 -g -fsanitize=address -shared-libsan --offload-arch=gfx90a:xnack+ -o invalid_access invalid_access.cpp

Running Applications

Follow these to run the instrumented code:

Enable XNACK using the environment 'HSA_XNACK=1'
Disable the Leak Sanitizer (LSAN) using the environment 'ASAN_OPTIONS=detect_leaks=0'. If this is not disabled, it will produce significant output from CPU code when the application exits due to allocations made by the runtime, which are not leaks necessarily.

Example:

$ HSA_XNACK=1 ASAN_OPTIONS=detect_leaks=0 ./application

Compilation Time

When ASAN is used, LLVM adds instrumentation code around memory operations. This could cause compilation times to increase in some cases. If compilation times become unacceptable, these are a few options to follow:

Avoid instrumenting files with the worst compilation times. The '-fsanitize-recover=address' can be added to files with the worst time; this option simplifies the added instrumentation
Disable instrumentation in functions that are responsible for large compilation times by adding '__attribute__((no_sanitize(“address”)))' to the function.

Reports

This is a sample report:

==2172529==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x154d415a3e94
READ of size 4 in workgroup id (3,0,0)
  #0 0x154d415a3e94 in kernel_reduction_gpu(float*, float*, int) at /usr/workspace/wsa/laguna/src/overflow/overflow.cpp:30:12

Thread ids and accessed addresses:
224 : 0x155546a0a000 225 : 0x155546a0a004 226 : 0x155546a0a008 227 : 0x155546a0a00c 228 : 0x155546a0a010 229 : 0x155546a0a014 230 : 0x155546a0a018 231 : 0x155546a0a01c
...

Known Issues

We have found a few issues when testing ASAN in EAS systems:

Double free: ASAN detects double free errors when free() is called more than once with the same memory address in the CPU; however, it doesn't detect double free errors in GPU with hipfree().
Memory copy: ASAN detects memcpy() errors when the CPU copies data on invalid addresses; however, it doesn't detect the same errors in the GPU with hipMemcpy().
Mismatched deallocation: if an array is allocated in C style (with malloc) and deallocated in C++ style (with delete) in the CPU, or vice versa, it is detected; however, mismatched deallocation in HIP doesn't seem to work; for example, allocating in C style and deallocating in HIP style using hipFree().

TotalView on AMD GPUs

TotalView 2024.3.10

The latest version of TotalView has support for AMD GPU's and is the default totalview on these systems. It should be in the default path, but if it is not, it can be found in

/usr/tce/bin/totalview

This is a wrapper to

/usr/global/tools/totalview/v/totalview.2024.3.10/bin/totalview

This version allows you to set breakpoints in the host and the GPU. It should also have support for watchpoints on the GPU! This is only available in global memory.

Note that following environment variables are needed to run an MPI or GPU job, successfully. These are set in the default TotalView module wrapper, but if you invoke TotalView directly via an absolute path, you may still need them.

TVD_DISABLE_CRAY=1
HSA_CU_MASK=0-7:0-95

With the update of the EAS3 systems to ROCM 6.2.1, TotalView can only debug GPU kernels built with ROCM 6.0.X. Only 2023 releases and beyond can debug GPU kernels on these systems. This is true of rocgdb as well. Older versions of TotalView will be able to debug on the host CPU side, but not the kernel.

Please report any issues to ServiceNow or tv-support@llnl.gov.

In TotalView releases prior to 2023 -no_rocm was the default. However with 2023.1.6 and after -rocm is the default. No additional parameters need to be included to debug Hip code or ROCm kernels in TotalView.

Currently one can:

Set breakpoints in the host and device code
Step or next in the device code
View assembler code on the GPU device
View registers on the device
View variables on the host and GPU. I think the GPU variables look correct, but your mileage may vary.
Access multiple GPU's (on those systems that have them, tioga, rzvernal, etc)

NOTES:

New versions will be installed as they become available. Anything with an 'X' in the version, such as 2024X.3.156, is considered an experimental version (ie prior to official release) and may have the latest development code included. It may be a non-optimized build and slower than the official releases

Using Flux

Systems using flux scheduling

Tioga and corona have switched to flux as the default and others will follow. After a bit of testing some minors bumps were resolved, and the correct settings for TotalView and flux are set in the global tvdrc file and should be transparent to users with the default TotalView on the system. Here is an example of a small batch shell which can be started with the flux batch command.

cat simple.csh

#! /bin/csh
setenv PATH /usr/global/tools/totalview/v/totalview.2024.3.10/bin:$PATH
setenv TVD_DISABLE_CRAY 1

setenv HSA_CU_MASK  0-7:0-95

date
pwd
echo $DISPLAY

totalview -args flux run  -n4 ./ALLc2
date

-----

 flux batch -N2 -n 4 ./simple.csh

STAT: Stack Trace Analysis Tool

See the STAT documentation page, which includes details for using the tool and running it on Flux-managed systems.