As described on this page, LC provides 3 tools for performance analysis and tracing on the El Capitan systems:

LC also provides tools for debugging.

AMD OmniPerf

See the official documentation page and an Introduction Tutorial (slides, presentation). TODO LINKS.

Overview

OmniPerf is a performance analysis tool built on top of the ROC Profiler. It records hardware performance counters for AMD GPUs during and can provide high level performance analysis features, including System Speed-of-Light, IP block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more.

  • OmniPerf is best used for analyzing single-node application runs, where a limited number of GPU and CPU kernels are instrumented.

For detailed usage instructions, see the online documentation, including the Getting Started Guide.

Loading the Tool

The OmniPerf tool is easily accessible by loading the module:

module load omniperf

Quickstart

See the Getting Started guide for details on profiling and selecting metrics.

Analysis

OmniPerf can provide analysis in 2 ways: through a CLI tool and a Grafana-based GUI built on top of a MongoDB. The current recommendation is to use the CLI analysis tool.

LC is currently in the process of deploying Grafana and MongoDB such that they can be used for OmniPerf analysis.

AMD OmniTrace

See the official documentation page an Introduction Tutorial (slides, presentation). TODO LINKS

Overview

OmniTrace is designed for both high-level profiling and comprehensive tracing of applications running on the CPU or the CPU+GPU via dynamic binary instrumentation, call-stack sampling, and various other means for determining currently executing function and line information. It records a timeline of kernel execution (including kernel launches) and records various metrics.

  • OmniTrace is best used for analyzing single-node application runs, where a limited number of GPU and CPU kernels are instrumented. The online visualization tools have a maximum of 1 GB memory, thus analyzing very large trace files will not work.

For detailed usage instructions, see the online documentation.

Loading the Tool

The OmniTrace tool is easily accessed by loading the module:

$ module load omnitrace

Instrumenting Binaries

OmniTrace has two distinct configuration steps when instrumenting:

  1. Instrumenting with OmniTraceConfiguring which functions and modules are instrumented in the target binaries (i.e., executable and/or libraries)

  2. Customizing OmniTrace RuntimeConfiguring what the instrumentation does happens when the instrumented binaries are executed

Visualizing Results

Visualization of the comprehensive omnitrace results (proto files) can be visualized with the Google web-based tool perfetto. LC has deployed CZ, RZ, and SCF instances at:

NOTE DO NOT use the recommended public ui.perfetto.dev website to upload any traces generated by Livermore machines.

Aggregated high-level results are available in text files for human consumption and JSON files for programmatic analysis. The JSON output files are compatible with the python package hatchet which converts the performance data into pandas dataframes and facilitate multi-run comparisons, filtering, visualization in Jupyter notebooks, and much more.

HPCToolKit

See the official documentation.

It is recommended that users begin with QuickStart (User Manual Chapter 3): http://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf.

Overview

HPCToolkit is a suite of performance analysis tools designed from the ground-up to work with HPC applications. It uses statistical sampling of timers and hardware counters on CPUs, and monitors GPU operations, gathering instruction-level metrics. HPCToolkit works with multilingual, fully optimized applications that are statically or dynamically linked. It supports measurement and analysis of serial codes, threaded codes (e.g., pthreads, OpenMP), MPI, and hybrid (MPI+threads) parallel codes, as well as GPU-accelerated codes that offload computation to AMD, Intel, or NVIDIA GPUs.

Components

HPCToolkit is a collection of distinct tools.

Diagram of HPCToolkit.

 

  • hpcrun: hpcrun collects accurate and precise calling-context-sensitive performance measurements for unmodified fully optimized applications at very low overhead (1-5%). It uses asynchronous sampling triggered by system timers and performance monitoring unit events to drive collection of call path profiles and optionally traces.

  • hpcstruct: To associate calling-context-sensitive measurements with source code structure, hpcstruct analyzes fully optimized application binaries and recovers information about their relationship to source code. In particular, hpcstruct relates object code to source code files, procedures, loop nests, and identifies inlined code.

  • hpcprof: hpcprof overlays call path profiles and traces with program structure computed by hpcstruct and correlates the result with source code. hpcprof-mpi handles thousands of profiles from a parallel execution by performing this correlation in parallel. hpcprof and hpcprof-mpi generate a performance database that can be explored using the hpcviewer user interface.

  • hpcviewer: hpcviewer is a graphical user interface that interactively presents performance data in three complementary code-centric views (top-down, bottom-up, and flat), as well as a graphical view that enables one to assess performance variability across threads and processes. hpcviewer is designed to facilitate rapid top-down analysis using derived metrics that highlight scalability losses and inefficiency rather than focusing exclusively on program hot spots.

    • hpcviewer also presents a hierarhical, time-centric view of a program execution. The tool can rapidly render graphical views of trace lines for thousands of processors for an execution tens of minutes long even a laptop. hpcviewer's hierarchical graphical presentation is quite different than that of other tools - it renders execution traces at multiple levels of abstraction by showing activity over time at different call stack depths.

Loading the Tool

The HPCToolkit tools are easily accessibly by loading the module:

module load hpctoolkit

Using the tools

It is recommended that users begin with QuickStart (User Manual Chapter 3): http://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf