Intel VTune Amplifier
Intel's VTune Amplifier is a performance profiling tool for C, C++, and Fortran code that can identify where in the code time is being spent in both serial and threaded applications. For threaded applications, it can also determine the amount of concurrency and identify bottlenecks created by synchronization primitives.
Platforms and Locations
|x86_64 CHAOS 5||/usr/local/tools/vtune*||Multiple versions are available. Use Dotkit to load.|
|x86_64 TOSS 3||/usr/tce/packages/vtune/*||Multiple versions are available. Use module to load.|
Before getting started, compile your code with -g to generate debug information that allows VTune Amplifier to correlate timing information with specific locations in your source code. Users should still run with optimizations to get an accurate representation of production run times. VTune Amplifier uses dynamic instrumentation and thus does not require use of Intel compilers or the use of any special compiler flags.
VTune Amplifier includes both a graphical user interface (GUI) and a command line (CL) interface that can be accessed with the amplxe-gui and amplxe-cl commands, respectively. When running the GUI, start by creating a new project, entering the executable path, arguments, environment variables, and setting other options. Once the VTune Amplifier project is created, set up a new analysis. The analysis types include: Basic Hotspots; Advanced Hotspots; Concurrency; and Locks and Waits. Basic Hotspots will profile your code's execution to determine which functions are consuming the most time and thus are targets for optimization. The hotspots analysis includes timing information from all threads and from sub-processes. Advanced Hotspots also profiles your application, but can also collect call stacks, context switch, statistical call count data, and cycles per instruction. Advanced Hotspots analysis uses the Linux perf events, which allows more detailed analysis at higher sampling frequency and with lower overhead than the Basic Hostspots. The concurrency analysis analyzes how well a threaded application takes advantage of multi-core hardware and identifies functions and times during execution where available CPUs aren't fully utilized. The locks and waits analysis adds the ability to identify synchronization points that contribute to underutilization of CPUs. VTune Amplifier uses sampling to gather profile information and should only incur a 5% execution-time overhead. Once the analysis type is configured, click "Start" to run the analysis. Note that there is also a "Show Command Line" button in the bottom right-hand corner, which is a handy way to generate the CL equivalent of your project and analysis configuration. There are other, more advanced analyses, such as HPC Performance Characterization, General Exploration, and Memory Access. Please refer to the VTune documentation to learn more about those analyses.
After the analysis is complete, several options allow for viewing the collected data. A set of tabs/buttons near the top of the GUI allows you to choose various windows. The Summary window gives an overview of the analysis. The Bottom-up window displays performance data from the perspective of the bottom-level functions. The Top-Down Tree window displays inclusive and exclusive performance data from the perspective of the function call stacks during execution. Within the Bottom-up and Top-Down Tree windows, there is a call stack pane to display the stack trace for the sampled data and also a timeline pane that shows the CPU activity of the threads over time. At the bottom of these windows there are also filters that let you sort the data by the executed module, by individual thread, or by specific process. Above the window tabs/buttons you can also change the viewpoint by clicking on the "change" link next to the analysis title. Each viewpoint is a preset configuration that filters the performance data to focus on specific performance issues. The GUI contains a wealth of other features that are beyond the scope of this document; refer to the product documentation for more information.
LC CHAOS Linux systems have an MPI-enabled CL that can be accessed via amplxe-cl-mpi. This command takes the same arguments as the serial amplxe-cl command and will automatically append the MPI rank of each process to the name of the results directory. An example usage with MPI would be to run
srun -n 16 amplxe-cl-mpi -r my_result -collect hotspots -- my_mpi_app arg1 arg2
to create results directories my_result.0 through my_result.15. The GUI does not provide a mechanism to run analysis of MPI application. The GUI also does not provide a way to view these aggregated MPI results, so each MPI task's results must be opened individually in the GUI. The 2017 version of Inspector includes a -trace-mpi option that also appends MPI ranks to results.
srun -n 16 amplxe-cl -trace-mpi -r my_result -collect hotspots -- my_mpi_app arg1 arg2
This will create a my_result.hostname directory for each host that your MPI application is run on. You can then open the .amplxe file from the amplxe-gui GUI. The results are aggregated per host, with each MPI process on that host displayed as a thread. There is currently no way to aggregate results across hosts.
As of VTune version 2015 some, but not all, of the previously unavailable features are now available (via Linux Perf), particularly with respect to gathering performance counters. The "Advanced Hotspots" and "General Exploration" analyses should now be runnable. However, some options, such as gathering call counts, loop counts, and context switches in the Advanced Hotspots analysis are not obtainable without the Intel drivers installed. Similarly, memory access and bandwidth information will not be obtainable with just Linux Perf. Options that are unavailable may result in an error message stating, "Cannot enable Hardware Event-based Sampling: problem with the driver (sep*/sepdrv*)." We have these capabilities enabled for a small set of nodes on various LC systems (rzoz).
VTune Amplifier provides a tutorial in /usr/tce/packages/vtune/default/documentation/en/tutorials/index.htm or on Intel's Web site. The example C++ code in the tutorial can be found in /usr/tce/packages/vtune/default/samples/en/.
Documentation and References
The VTune Amplifier documentation can be found in /usr/tce/packages/vtune/default/documentation/en/welcomepage/get_started.htm or on Intel's Web site.
For more information, visit Intel's VTune Amplifier Web page.
Some applications generate errors about tread/execv access despite the fact that the executable being run does indeed have read and execute privileges:
amplxe: Error: [Instrumentation Engine]: Pin can’t be injected to the application: 232484 since it does not have read and execv access to it amplxe: Collection failed. amplxe: Internal Error
We have seen this happen often with applications that end up invoking other processes for which the user does not have read privileges. To work around this issue, you may need to specify the -no-follow-child option to the amplxe-cl command or by unchecking the "Analyze child processes" option in the GUI.