HPC Toolkit

HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers. It uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur.

HPCToolkit works with C/C++/Fortran applications that are either statically or dynamically linked. It supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.

HPCToolkit's primary components and their relationships are described below:

  • hpcrun: collects accurate and precise calling-context-sensitive performance measurements for unmodified fully optimized applications at very low overhead (1-5%). It uses asynchronous sampling triggered by system timers and performance monitoring unit events to drive collection of call path profiles and optionally traces.
  • hpcstruct: To associate calling-context-sensitive measurements with source code structure, hpcstruct analyzes fully optimized application binaries and recovers information about their relationship to source code. In particular, hpcstruct relates object code to source code files, procedures, loop nests, and identifies inlined code.
  • hpcprof: overlays call path profiles and traces with program structure computed by hpcstruct and correlates the result with source code. hpcprof/mpi handles thousands of profiles from a parallel execution by performing this correlation in parallel. hpcprof and hpcprof/mpi generate a performance database that can be explored using the hpcviewer and hpctraceviewer user interfaces.
  • hpcviewer: a graphical user interface that interactively presents performance data in three complementary code-centric views (top-down, bottom-up, and flat), as well as a graphical view that enables one to assess performance variability across threads and processes. hpcviewer is designed to facilitate rapid top-down analysis using derived metrics that highlight scalability losses and inefficiency rather than focusing exclusively on program hot spots.
  • hpctraceviewer: a graphical user interface that presents a hierarchical, time-centric view of a program execution. The tool can rapidly render graphical views of trace lines for thousands of processors for an execution tens of minutes long even a laptop. hpctraceviewer's hierarchical graphical presentation is quite different than that of other tools - it renders execution traces at multiple levels of abstraction by showing activity over time at different call stack depths.

Platforms and Locations

Platform Location Notes
x86_64 CHAOS 5 /usr/global/tools/hpctoolkit Multiple versions are available. Use Dotkit to load.
x86_64 TOSS 3 /usr/global/tools/hpctoolkit Multiple versions are available. Use module to load.
BG/Q /usr/global/tools/hpctoolkit Multiple versions are available. Use Dotkit to load.

Quick Start

x86_64 Linux Systems

Using HPCToolkit involves following the workflow shown in the diagram below. Note that in the instructions that follow, only the steps for dynamically linked executables are shown. This is the default for LC's Linux systems. Statically linked applications follow a slightly different path - similar to that for BG/Q systems, in which executables are statically linked by default. See the HPCToolkit documentation for details.

1. First, determine which version of HPCToolkit you want to use, and then load that Dotkit package. For example:

% module avail hpctoolkit

------------------ /usr/tce/modulefiles/Compiler/intel/16.0.3 ------------------
   hpctoolkit/10102016

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

% module load hpctoolkit/10102016

2. Compile your application using -g and full optimization. The optimization flag(s) will vary between compilers, but something like -O3 is typical.

% mpiicc -g -O3 -o myapp myapp.c 

3. Decide which events you want to sample. The available events can be listed with the hpcrun -L command. For example:

% hpcrun -L
===========================================================================
Available itimer events
===========================================================================
Name  Description
---------------------------------------------------------------------------
WALLCLOCK Wall clock time used by the process in microseconds

===========================================================================
Available PAPI preset events
===========================================================================
Name     Profilable Description
---------------------------------------------------------------------------
PAPI_L1_DCM Yes Level 1 data cache misses
PAPI_L1_ICM Yes Level 1 instruction cache misses
PAPI_L2_DCM No Level 2 data cache misses
PAPI_L2_ICM Yes Level 2 instruction cache misses
PAPI_L1_TCM No Level 1 cache misses

...
[ very long list truncated ]
...

===========================================================================
Available IO events
===========================================================================
Name  Description
---------------------------------------------------------------------------
IO  The number of bytes read and written per dynamic context

===========================================================================
Available Global Arrays events
===========================================================================
Name  Description
---------------------------------------------------------------------------
GA  Collect Global Arrays metrics

4. Run your application under the HPCToolkit hpcrun command. For MPI jobs, you will need to run hpcrun under the appropriate MPI launch command. At LC, this is usually srun. You will also need to specify the events (from above) and sampling period as arguments to the hpcrun command. For example:

% srun -n4 hpcrun -e PAPI_TOT_CYC@15000000 -e PAPI_L2_TCM@400000 myapp

To include event tracing, simply add the -t option to the hpcrun command:

% srun -n4 hpcrun -t -e PAPI_TOT_CYC@15000000 -e PAPI_L2_TCM@400000 myapp

A note on sampling period: using the example above, after 400000 PAPI_L2_TCM events, HPCToolkit will interrupt the application to inspect it and generate a sample. It is recommended to select a period which results in no more than a couple hundred samples per second, since more frequent sampling will increase overhead. Finding the right period for an event may take several tries.

5. When your job completes, HPCToolkit will produce a measurements database that contains separate measurement information for each MPI rank and thread in the application. The database directory is named according the form:

hpctoolkit-myapp-measurements-jobid

Within the database directory, individual measurements files for each task/thread that are named using the template:

myapp-mpirank-threadid-hostid-processid-generationid.hpcrun

For example:

% ls hpctoolkit-myapp-measurements-1814406/
myapp-000000-000-a8c00170-56020-0.hpcrun  myapp-000002-000-a8c00170-56022-0.hpcrun
myapp-000000-000-a8c00170-56020-0.log     myapp-000002-000-a8c00170-56022-0.log
myapp-000001-000-a8c00170-56021-0.hpcrun  myapp-000003-000-a8c00170-56023-0.hpcrun
myapp-000001-000-a8c00170-56021-0.log     myapp-000003-000-a8c00170-56023-0.log

If you included tracing, there will also be a *.hpctrace file for each process/thread.

6. Generate an HPCToolkit program structure file for your application using the hpcstruct command:

% hpcstruct myapp

The resulting file will be named myapp.hpcstruct. It will be used in the next step.

7. Generate an HPCToolkit performance database summary using the hpcprof command and specifying the hpcstruct file (previous step) and the name of your application's database directory. For example:

% hpcprof -S myapp.hpcstruct hpctoolkit-myapp-measurements-1814406

The resulting database summary, in this case, will be contained in a directory named hpctoolkit-myapp-database-1814406.

8. Use the hpcviewer utility to interactively view and analyze the HPCToolkit performance database:

% hpcviewer hpctoolkit-myapp-database-1814406

To view/analyze tracefile data, use the hpctraceviewer utility:

% hpctraceviewer hpctoolkit-myapp-database-1814406

Several examples of hpcviewer and hpctraceviewer are shown in the Output section below.

BG/Q Systems

On BG/Q systems, executables are built statically linked by default. For the most part, using HPCToolkit on BG/Q systems is similar to using it on Linux systems. The important differences at LC are noted in the instructions below.

1. Load the desired HPCToolkit Dotkit package as done in step 1 under Linux systems.

2. Compile your source files with -g and full optimization to produce *.o files. Then use HPCToolkit's hpclink command to perform the final link in your build. For example:

% mpixlc -g -O3 -c routine1.c routine2.c ...
% hpclink mpixlc -o myapp routine1.o routine2.o ...

3. Decide which events you want to monitor. This is a little less convenient on BG/Q than it is on Linux. The hpcrun -L command does not currently work on BG/Q. A workaround is to set the HPCRUN_EVENT_LIST environment variable to a value of "LIST", and then build/run a single task dummy job (hello world) to have HPCToolkit generate the list. For example:

% setenv HPCRUN_EVENT_LIST LIST
% hpclink mpixlc -o hello hello.c
% srun -n1 -ppdebug hello

Example output showing BG/Q events is provided HERE.

4. Now set the HPCRUN_EVENT_LIST environment variable to specify which events you really want to monitor and their periods. Use commas to separate multiple events. If you want to include tracing, set the HPCRUN_TRACE environment variable to "1". For example:

% setenv HPCRUN_EVENT_LIST "WALLCLOCK@5000"
% setenv HPCRUN_TRACE 1 

5. Run your program as usual (do not use hpcrun). HPCTookit will produce its output files which may be further processed and viewed as described in steps 5 - 8 under Linux systems above.

Important note: As of 5/15, LC has not built/installed the hpcviewer and hpctraceviewer utilities for BG/Q. However, you can use the ones built/installed on Linux systems to view BG/Q produced output files.

Output

HPCToolkit will produce one or more measurements output files for each process/thread, depending upon the run-time options used. These files are written to the hpctoolkit-myapp-measurements-jobid directory. Further processing by the hpcstruct and hpcprof commands results in performance database files written to the hpctoolkit-myapp-database-jobid directory.

The hpcviewer and hpctraceviewer utilities use the performance database directory files to graphically view and analyze an application's behavior. Examples of both utilities are shown below, however users will want to consult the HPCToolkit documentation for details.

Important note: As of 5/15, LC has not built/installed the hpcviewer and hpctraceviewer utilities for BG/Q. However, you can use the ones built/installed on Linux systems to view BG/Q produced output files.

Compiling and Linking

Compiling: applications should be compiled with the -g flag plus full optimization. Optimization flags differ between compilers, but something like -O3 is typical.

Linking: for dynamically linked applications, no special HPCToolkit linking instructions are required. However, statically linked applications (such as with BG/Q) will need to use the HPCToolkit hpclink command as part of their application build process. This is described in the HPCToolkit User's Manual and under BG/Q systems above.

Run-time Options

Each of the commands used in the HPCToolkit workflow path have run-time options associated with them. The HPCToolkit documentation, specifically the User's Manual and Man Pages, cover the details.

Troubleshooting

  • HPCToolkit is a complex toolkit, and as such, troubleshooting problems may be difficult for the average user.
  • The most common problem at LC is probably forgetting to load the HPCToolkit environment using the Dotkit use command.
  • Statically linked applications need to follow a different workflow path, covered in the HPCToolkit documentation and under BG/Q systems above.
  • Most problems, if not easily resolved, should be reported to the LC Hotline.

Documentation and References

 

LLNL-WEB-670397