Intel's VTune Amplifier is a performance profiling tool for C, C++, and Fortran code that can identify where in the code time is being spent in both serial and threaded applications. For threaded applications, it can also determine the amount of concurrency and identify bottlenecks created by synchronization primitives.
Platforms and Locations
|x86_64 TOSS 3||/usr/tce/packages/vtune/*||Multiple versions are available. Use module to load.|
Before getting started, compile your code with -g to generate debug information that allows VTune Amplifier to correlate timing information with specific locations in your source code. Users should still run with optimizations to get an accurate representation of production run times. VTune Amplifier uses dynamic instrumentation and thus does not require use of Intel compilers or the use of any special compiler flags.
VTune Amplifier includes both a graphical user interface (GUI) and a command line (CL) interface that can be accessed with the vtune-gui and vtune commands, respectively. When running the GUI, start by creating a new project, entering the executable path, arguments, environment variables, and setting other options. Once the VTune Amplifier project is created, set up a new analysis. The analysis types include: Hotspots; Threading; and Memory Consumption. Hotspots will profile your code's execution to determine which functions are consuming the most time and thus are targets for optimization. The hotspots analysis includes timing information from all threads and from sub-processes. The threading analysis analyzes how well a threaded application takes advantage of multi-core hardware and identifies functions and times during execution where available CPUs aren't fully utilized. The memory consumption analysis identifies RAM usage over time and identifies memory objects allocated and released during the analysis run. VTune Amplifier uses sampling to gather profile information and should only incur a 5% execution-time overhead. Once the analysis type is configured, click "Start" to run the analysis. Note that there is also a "Show Command Line" button, which is a handy way to generate the CL equivalent of your project and analysis configuration. There are other, more advanced analyses, such as HPC Performance Characterization, Microarchitecture Exploration, and Memory Access. Please refer to the VTune documentation to learn more about those analyses. Some of these advanced analyses require VTune drivers which are not installed on production LC systems. Please refer to CZ Confluence for information on testbed systems with the appropriate drivers.
After the analysis is complete, several options allow for viewing the collected data. A set of tabs/buttons near the top of the GUI allows you to choose various windows. The Summary window gives an overview of the analysis. The Bottom-up window displays performance data from the perspective of the bottom-level functions. The Top-Down Tree window displays inclusive and exclusive performance data from the perspective of the function call stacks during execution. Within the Bottom-up and Top-Down Tree windows, there is a call stack pane to display the stack trace for the sampled data and also a timeline pane that shows the CPU activity of the threads over time. At the bottom of these windows there are also filters that let you sort the data by the executed module, by individual thread, or by specific process. The GUI contains a wealth of other features that are beyond the scope of this document; refer to the product documentation for more information.
LC CHAOS Linux systems have an MPI-enabled CL that can be accessed via vtune-mpi. This command takes the same arguments as the serial vtune command and will automatically append the hostname of each node to the name of the results directory. Note the "-r <result_dir_name>" option is required. An example usage with MPI would be to run
srun -N 2 -n 16 vtune-mpi -r my_result -collect hotspots -- my_mpi_app arg1 arg2
to create results directories my_result.hostX, ..., my_result.hostY. The GUI does not provide a mechanism to run analysis of MPI application. The 2017 and newer versions of Inspector include a -trace-mpi option that also appends hostnames to results directory.
srun -N 2 -n 16 vtune -trace-mpi -r my_result -collect hotspots -- my_mpi_app arg1 arg2
This will create a my_result.<hostname> directory for each host that your MPI application is run on. You can then open the .vtune file for a given host from the vtune-gui GUI. The results are aggregated per host, with each MPI process on that host displayed as a thread. The process filter also displays the MPI rank of a given process. There is currently no way to aggregate results across hosts.
As of VTune version 2015 some, but not all, of the previously unavailable features are now available (via Linux Perf), particularly with respect to gathering performance counters. Options that are unavailable may result in an error message stating, "vtune: Error: This analysis type requires either an access to system-wide monitoring in the Linux perf subsystem or installation of the VTune Profiler drivers." We have these capabilities enabled for a small set of nodes on various LC systems (rzwiz, ipa).
Intel provides a VTune Amplifier tutorial on Intel's Tutorials Website. The example C++ code in the tutorial can be found in /usr/tce/packages/vtune/default/samples/en/C++.
Documentation and References
The VTune Amplifier documentation can be found in /usr/tce/packages/vtune/default/documentation/en/welcomepage/get_started.htm or on Intel's Amplifier Help Page.
For more information, visit Intel's VTune Amplifier Web page.
Some applications generate errors about tread/execv access despite the fact that the executable being run does indeed have read and execute privileges:
vtune: Error: [Instrumentation Engine]: Pin can’t be injected to the application: 232484 since it does not have read and execv access to it vtune: Collection failed. vtune: Internal Error
We have seen this happen often with applications that end up invoking other processes for which the user does not have read privileges. To work around this issue, you may need to specify the -no-follow-child option to the vtune command or by unchecking the "Analyze child processes" option in the GUI.