Open|SpeedShop is a comprehensive, open source performance analysis tool framework that integrates the most common performance analysis steps all in one tool. Open|SpeedShop supports:
- Program Counter Sampling
- Callstack Analysis
- Hardware Performance Counters
- MPI Profiling and Tracing
- I/O Profiling and Tracing
- Floating Point Exception Analysis
- Memory Function Tracing
- Pthreads Function Tracing
- NVIDIA CUDA Event Tracing
All analysis is performed on unmodified binaries, and can be used on serial, MPI and threaded codes. Open|SpeedShop performance data collection is built around "experiments". Each experiment collects a specific type of performance data.
Three user interface options are provided: Graphical user interface, command line, and a Python scripting API.
The Component Based Tool Framework (CBTF) is a new, experimental implementation of Open|SpeedShop that improves tool scalability and adds new features. It is a younger product still under development, but should be stable on LLNL machines.
Platforms and Locations
|x86_64 Linux||/usr/global/tools/openspeedshop/||Multiple versions are available. Load with Dotkit.|
|BG/Q||/usr/global/tools/openspeedshop/||Multiple versions are available. Load with Dotkit.|
- Determine which Dotkit version of Open|SpeedShop you want to load, and then load that Dotkit package. Note that at LC, the CBTF version has been made available. For example:
% use -l openss performance/profile ---------- openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) openss21 - Open|Speedshop (Version 2.1) openss - Open|Speedshop (Version 2.1 CBTF for x8664) % use -l cbtf performance/profile ---------- cbtf-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) cbtf - Open|Speedshop (Version 2.1 CBTF for x8664) openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) openss - Open|Speedshop (Version 2.1 CBTF for x8664) % use openss Prepending: openss (ok)
- Determine which experiment you want to run, based upon the type(s) of performance data you are interested in collecting. The available experiments are shown in the table below.
Experiment Description pcsamp Periodic sampling the program counters gives a low-overhead view of where the timeis being spent in the user application. usertime Periodic sampling the call path allows the user to view inclusive and exclusive time spent in application routines. It also allows the user to see which routines called which routines. Several views are available, including the "hot" path. hwc Hardware events (including clock cycles, graduated instructions, instruction and data cache and TLB misses, floating-point operations) are counted at the machine instruction, source line and function levels. hwcsamp Similar to hwc, except that sampling is based on time, not PAPI event overflows. Up to six events may be sampled during the same experiment. hwctime Similar to hwc, except that call path sampling is also included. io Accumulated wall-clock durations of input/output (I/O) system calls: read, readv, write, writev, open, close, dup, pipe, creat and others. Show call paths for each unique I/O call path. iop* Lightweight I/O profiling: Accumulated wall-clock durations of I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat and others, but individual call information is not recorded. iot Similar to io, except that more information is gathered, such as bytes moved, file names, etc. mpi Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path. mpip* Lightweight MPI profiling: Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path, but individual call information is not recorded. mpit Records each MPI function call event with specific data for display using a GUI or a command line interface (CLI). Trace format option displays the data for each call, showing its start and end times. fpe Find where each floating-point exception occurred. A trace collects each with its exception type and the call stack contents. These measurements are exact, not statistical mpiotf Write MPI calls trace to Open Trace Format (OTF) files to allow viewing with Vampir or converting to formats of other tools. mem* Captures the time spent in and the number of times each memory function is called. Show call paths for each memory function's unique call path. pthreads* Captures the time spent in and the number of times each POSIX thread function is called. Show call paths for each POSIX thread function's unique call path. cuda* Captures the NVIDIA CUDA events that occur during the application execution and report times spent for each event, along with the arguments for each event, in an event-by-event trace.
* Only available in Open|SpeedShop using CBTF collection mechanism (currently under development)
- Each experiment maps to an Open|SpeedShop command: the name of the experiment prepended with "oss". For example, to run the "fpe" experiment, you would use the command "ossfpe".
- Most experiments have options. To get additional information on a command, simply enter the command name with no arguments. You can also enter the command followed by "help" or "--help".
- Run the experiment of choice on your application, providing any required arguments and any Open|SpeedShop options. The general syntax for MPI codes at LC, and examples are shown below.
osscmd "srun [srun options] usercmd [usercmd options]" [osscmd options]
Example: collect PAPI hardware counter events for matmult executable running with 4 MPI tasks. Uses the default PAPI event PAPI_TOT_CYC.
% osshwc "srun -n4 matmult"
Example: collect multiple PAPI hardware counter events plus program counter sampling for matmult executable running with 4 MPI tasks. Specifies L1, L2 and L3 cache misses in addition to CPU time provided by PC sampling.
% osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM
- Open|SpeedShop output consists of text written to stdout and a database file. These are discussed in the Output section below.
As your program executes, Open|SpeedShop will write status/diagnostic information to stdout. Upon program completion, a report will be produced on stdout also. The format and content of the report depends upon the Open|SpeedShop experiment that was run.
Open|SpeedShop will also create a database file in the working directory. The file will be named as:
The database file is used for analysis with the Open|SpeedShop GUI.
Examples of both types of output are shown below.
% osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM [openss]: hwcsamp experiment using input papi event: "PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM". [openss]: hwcsamp experiment using the hwc experiment default sampling_rate: "100". [openss]: hwcsamp experiment calling openss. [openss]: Setting up offline raw data directory in /p/lscratche/blaise/offline-oss [openss]: Running offline hwcsamp experiment using the command: "srun -n4 /collab/usr/global/tools/openspeedshop/oss-dev/x8664/oss_offline_v2.1u4/bin/ossrun -c hwcsamp ./matmult" Output from matmult executable removed for clarity [openss]: Converting raw data from /p/lscratche/blaise/offline-oss into temp file X.0.openss Processing raw data for matmult ... Processing processes and threads ... Processing performance data ... Processing symbols ... Resolving symbols for /g/g0/blaise/matmult/matmult Resolving symbols for /usr/local/tools/mvapich-gnu-1.2/lib/shared/libmpich.so.1.0 Resolving symbols for /lib64/libc-2.12.so Resolving symbols for /lib64/libpthread-2.12.so Resolving symbols for /usr/lib64/libpsm_infinipath.so.1.14 Resolving symbols for /usr/lib64/libinfinipath.so.4.0 Updating database with symbols ... Finished ... [openss]: Restoring and displaying default view for: /g/g0/blaise/tau/workshop/matmult/matmult-hwcsamp.openss [openss]: The restored experiment identifier is: -x 1 Exclusive % of CPU papi_l1_tcm papi_l2_tcm papi_l3_tcm Function (defining location) CPU time Time in seconds. 29.130000 75.978091 6737152983 1648401202 554678152 multiply_matrices_ (matmult: matmult.f90,25) 5.770000 15.049557 58094130 25937153 4460313 ips_ptl_poll (libpsm_infinipath.so.1.14) 1.260000 3.286385 6211521 4328113 572851 __psmi_poll_internal (libpsm_infinipath.so.1.14) 1.180000 3.077726 4660014 3480615 262262 psm_mq_wait (libpsm_infinipath.so.1.14) 0.440000 1.147626 4010322 1844752 282742 __GI___sched_yield (libc-2.12.so: syscall-template.S,82) 0.270000 0.704225 1280815 958226 79788 MAIN__ (matmult: matmult.f90,39) 0.160000 0.417319 10697294 5052727 1891272 ipath_dwordcpy (libinfinipath.so.4.0) 0.050000 0.130412 12954 7680 1581 initialize_ (matmult: matmult.f90,4) 0.010000 0.026082 2297350 559612 186424 MPID_PSM_RecvComplete (libmpich.so.1.0: psmrecv.c,73) 0.010000 0.026082 2317920 568133 192646 MPID_PSM_Send (libmpich.so.1.0: psmsend.c,36) 0.010000 0.026082 1117412 515398 270530 pthread_spin_lock (libpthread-2.12.so: pthread_spin_lock.c,35) 0.010000 0.026082 893876 490998 136812 psmi_amsh_long_reply (libpsm_infinipath.so.1.14) 0.010000 0.026082 41437 31068 2099 ips_spio_transfer_frame (libpsm_infinipath.so.1.14) 0.010000 0.026082 2315250 569334 193507 ips_proto_flow_enqueue (libpsm_infinipath.so.1.14) 0.010000 0.026082 2297885 559894 186956 ips_proto_process_packet_inner (libpsm_infinipath.so.1.14) 0.010000 0.026082 1114033 512588 270345 ipath_dwordcpy_safe (libinfinipath.so.4.0) 38.340000 100.000000 6834515196 1693817493 563668280 Report Summary
To view the database output, find the relevant *.openss file in your working directory, and then call the Open|SpeedShop GUI with that file. For example:
% openss matmult-hwcsamp.openss
The GUI will then appear, displaying the results of the experiment. Users should consult the Open|SpeedShop documentation for details.
Compiling and Linking
Open|SpeedShop experiments operate on executable binaries, so there are no special compilation or linking requirements. Users just need to ensure they load an appropriate Open|SpeedShop dotkit package - see step 1 under the Quick Start instructions.
All Open|SpeedShop experiment commands have options. To view the available options, simply enter the name of the experiment command by itself, or call it with the --help flag.
In addition to these, there are a number of environment variables that can be used to direct run-time behavior. Some of these are optional, and some may be required. For details, see Open|SpeedShop documentation. For convenience, key environment variables are listed in the table below, reproduced from the Open|SpeedShop User's Guide.
|OPENSS_RAWDATA_DIR||Used on cluster systems where a /tmp file system is unique on each node. It specifies the location of a shared file system path which is required for O|SS to save the raw data files on distributed systems.
OPENSS_RAWDATA_DIR=shared file system path
Example: export OPENSS_RAWDATA_DIR=/lustre4/fsys/userid
|OPENSS_ENABLE_MPI_PCONTROL||Activates the MPI_Pcontrol function recognition, otherwise MPI_Pcontrol function calls will be ignored by O|SS.|
|OPENSS_DATABASE_ONLY||When running the Open|SpeedShop convenience scripts only create the database file and do NOT put out the default report. Used to reduce the size of the batch file output files if user is not interested in looking at the default report.|
|OPENSS_RAWDATA_ONLY||When running the Open|SpeedShop convenience scripts only gather the performance information into the OPENSS_RAWDATA_DIR directory, but do NOT create the database file and do NOT put out the default report.|
|OPENSS_DB_DIR||Specifies the path to where O|SS will build the database file. On a file system without file locking enabled, the SQLite component cannot create the database file. This variable is used to specify a path to a file system with locking enabled for the database file creation. This usually occurs on lustre file systems that don't have locking enabled.
OPENSS_DB_DIR=file system path
Example: export OPENSS_DB_DIR=/opt/filesys/userid
|OPENSS_MPI_IMPLEMENTATION||Specifies the MPI implementation in use by the application; only needed for the mpi, mpit, and mpiotf experiments. These are the currently supported MPI implementations: openmpi, lampi, mpich, mpich2, mpt, lam, mvapich, mvapich2. For Cray, IBM, Intel MPI implementations, use mpich2.
OPENSS_MPI_IMPLEMENTATION=MPI impl. name
Example: export OPENSS_MPI_IMPLEMENTATION=openmpi
In most cases, O|SS can auto-detect the MPI in use.
- Open|SpeedShop is a complex toolkit, and as such, troubleshooting problems may be difficult for the average user.
- The most common problem is forgetting to load the Open|SpeedShop environment using the use openss-packagename command.
- Most problems, if not easily resolved, should be reported to the LC Hotline.
Documentation and References
The most important Open|SpeedShop links are listed below. Searching the web will find additional Open|SpeedShop documentation and presentations hosted by third parties.
- Open|SpeedShop Home Page, with links to Documentation, Tutorials, Presentations and more: http://www.openspeedshop.org
- Open|SpeedShop Documentation - direct link: http://www.openspeedshop.org/documentation/