TAU: Tuning and Analysis Utilities

TAU (Tuning and Analysis Utilities) is a comprehensive profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, and Python. It is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. All C++ language features are supported including templates and namespaces. The instrumentation consists of calls to TAU library routines, which can be incorporated into a program in several ways:

Automatic instrumentation using the compiler
Automatic instrumentation using the Program Database Toolkit (PDT)
Manual instrumentation using the instrumentation API
At runtime using library call interception through the tau_exec command
Dynamically using DyninstAPI
At runtime in the Java virtual machine

Data Analysis and Visualization:

Profile data: TAU's profile visualization tool, ParaProf, provides a variety of graphical displays for profile data to help users quickly identify sources of performance bottlenecks. The text based pprof tool is also available for analyzing profile data.
Trace data: TAU provides the JumpShot trace visualization tool for graphical viewing of trace data. TAU also provide utilities to convert trace data into formats for viewing with Vampir, Paraver, and other performance analysis tools.

Programming models and platforms: TAU supports most commonly used parallel hardware and programming models, including Intel, Cray, IBM, Sun, Apple, SGI, GPUs/Accelerators, HP, NEC, Fujitsu, MS Windows, using MPI, OpenMP, Pthreads, OpenCL, CUDA, and Hybrid.

Platforms and Locations

Platform	Location	Notes
x86_64 Linux	/usr/global/tools/tau/	Load the module: module load tau
CORAL	/usr/global/tools/tau/	Load the module: module load tau

Quick Start

TAU is a sophisticated, full-featured toolkit. Only a subset of TAU's features, at a "very basics" level, are discussed below. Users will need to consult the TAU documentation to learn more.

1. Profiling

The easiest and quickest way to profile an application is to use the tau_exec command. It automatically instruments your executable at run time, and requires no special compilation or modifications to source code. All you need to do is make sure your TAU environment is setup correctly.

1. Setup your TAU environment by loading the TAU module. Also, just to be sure, set the TAU_PROFILE environment variable to "1". Optionally, you can specify where the profile files are written (default is working directory).

% module load tau
% setenv TAU_PROFILE 1
% setenv PROFILEDIR  /p/lscratche/joesmith/matmultProfiles
% mkdir -p $PROFILEDIR

2. Run your program using the tau_exec command. For example, launching a 64 task MPI job in the pdebug partition:

% srun -n64 -ppdebug tau_exec matmult

3. Following completion of your job, you will have a set of files named profile.#.* where # denotes the MPI rank. Viewing these files is discussed in the Output section below.

Another way to automatically instrument your application is to use the TAU Makefile scripts. This is slightly more work, but is required for profiling some parameters, such as hardware counter (PAPI) events.

% module load tau
% setenv TAU_PROFILE 1
% setenv PROFILEDIR  /p/lscratche/joesmith/matmultProfiles
% mkdir -p $PROFILEDIR

2. Determine the path of the TAU libraries you've loaded. An easy way to do this is shown below:

% module show tau |& grep LIBRARY
prepend_path("LD_LIBRARY_PATH","/usr/global/tools/tau/training/tau-2.29/x86_64/lib")

3. Using the TAU library path from above, select the appropriate TAU Makefile for what you want to profile. They are named according to what they instrument. For example:

% ls /usr/local/tools/tau-2.21.1/x86_64/lib/Makefile*
/usr/global/tools/tau/training/tau-2.29/x86_64/lib/Makefile.tau-icpc-ompt-v5-pdt-openmp
/usr/global/tools/tau/training/tau-2.29/x86_64/lib/Makefile.tau-icpc-papi-mpi-pthread-pdt
/usr/global/tools/tau/training/tau-2.29/x86_64/lib/Makefile.tau-icpc-papi-ompt-v5-mpi-pdt-openmp
/usr/global/tools/tau/training/tau-2.29/x86_64/lib/Makefile.tau-icpc-papi-ompt-v5-pdt-openmp

4. Set TAU_MAKEFILE to the full pathname of the Makefile you choose. This example uses the Intel compiler to profile MPI and PAPI:

% setenv TAU_MAKEFILE /usr/global/tools/tau/training/tau-2.29/x86_64/lib/Makefile.tau-icpc-papi-mpi-pthread-pdt

5. Compile your application using the appropriate TAU compiler wrapper script. These are located in the /bin directory of the TAU package you loaded, and should be in your path. The choices are shown in the table below. Note that if you are using makefiles, you will need to substitute these wrapper scripts accordingly.

Language	TAU Compiler Wrapper
C	tau_cc.sh
C++	tau_cxx.sh
Fortran77	tau_f77.sh
Fortran90	tau_f90.sh

For example, compiling a simple C program:

% tau_cc.sh -O3 -g -o matmult matmult.c

Note that compiler options will get passed to the native compiler of your choice. Also note that TAU provides a number of its own compiler options, not discussed here. For details, see TAU Compiler Options.

6. Run your TAU instrumented executable as usual. For example, launching a 64 task MPI job in the pdebug partition:

% srun -n64 -ppdebug matmult

7. Following completion of your job, you will have a set of files named profile.#.* where # denotes the MPI rank. Viewing these files is discussed in the Output section below.

2. Tracing

TAU can be used to trace events during a program's execution. Unlike profiling, which aggregates the time spent in each routine, loop, etc. tracing allows you to view events as they relate to each other against a timeline. One caveat about tracing however, is that trace files can quickly grow to be very large, which makes tracing difficult or impossible for long running, many process jobs.

As with profiling, the easiest and quickest way to trace an application is to use the tau_exec command. It automatically instruments your executable at run time, and requires no special compilation or modifications to source code. All you need to do is make sure your TAU environment is setup correctly.

1. Setup your TAU environment by loading the TAU module. Also, make sure the TAU_TRACE environment variable is set to "1". If you want to specify a directory where the tracefiles should be written (default is the working directory), use the TRACEDIR environment variable.

% module load tau
% setenv TAU_TRACE 1
% setenv TRACEDIR /p/lscratche/joesmith/matmultTracefiles
% mkdir -p $TRACEDIR

Note: If you want to include TAU profiling at the same time as tracing, set the TAU_PROFILE environment variable to "1". By default, it is turned off when tracing.

2. Run your program using the tau_exec command. For example, launching a 64 task MPI job in the pdebug partition:

% srun -n64 -ppdebug tau_exec matmult

3. Following completion of your job, you will have a two sets of files named tautrace.#.*.trc and events.#.edf where # denotes the MPI rank. Viewing these files is discussed in the Output section below.

As with profiling, you can automatically instrument your application for tracing by using the TAU makefile scripts, as an alternative to using the tau_exec command.

1. First, setup your TAU environment by loading the TAU module. Also, make sure the TAU_TRACE environment variable is set to "1". If you want to specify a directory where the tracefiles should be written (default is the working directory), use the TRACEDIR environment variable.

% module load tau
% setenv TAU_TRACE 1
% setenv TRACEDIR /p/lscratche/joesmith/matmultTracefiles
% mkdir -p $TRACEDIR

2. Then, follow steps 2 through 6 above under Profiling.

3. PAPI Hardware Counters

TAU can be used to record hardware events through PAPI hardware counters. This is actually a type of profiling, so the instructions are very similar to those for Profiling above.

1. Follow steps 1 through 5 under Profiling using TAU Makefiles to build an instrumented executable.

2. Determine which PAPI events are available on the platform you are using with the papi_avail command:

% papi_avail
Available events and hardware information.
--------------------------------------------------------------------------------
PAPI Version             : 5.2.0.0
Vendor string and code   : GenuineIntel (1)
Model string and code    : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz (79)
CPU Revision             : 1.000000
CPUID Info               : Family: 6  Model: 79  Stepping: 1
CPU Max Megahertz        : 2101
CPU Min Megahertz        : 1200
Hdw Threads per core     : 1
Cores per Socket         : 18
Sockets                  : 4
NUMA Nodes               : 2
CPUs per Node            : 36
Total CPUs               : 72
Running in a VM          : no
Number Hardware Counters : 11
Max Multiplex Counters   : 64
--------------------------------------------------------------------------------

    Name        Code    Avail Deriv Description (Note)
PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
PAPI_L3_DCM  0x80000004  No    No   Level 3 data cache misses
PAPI_L3_ICM  0x80000005  No    No   Level 3 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes   Yes  Level 1 cache misses
PAPI_L2_TCM  0x80000007  Yes   No   Level 2 cache misses
PAPI_L3_TCM  0x80000008  Yes   No   Level 3 cache misses
PAPI_CA_SNP  0x80000009  Yes   No   Requests for a snoop
PAPI_CA_SHR  0x8000000a  Yes   No   Requests for exclusive access to shared cache line
PAPI_CA_CLN  0x8000000b  Yes   No   Requests for exclusive access to clean cache line
PAPI_CA_INV  0x8000000c  Yes   No   Requests for cache line invalidation
PAPI_CA_ITV  0x8000000d  Yes   No   Requests for cache line intervention
PAPI_L3_LDM  0x8000000e  Yes   No   Level 3 load misses
PAPI_L3_STM  0x8000000f  No    No   Level 3 store misses
PAPI_BRU_IDL 0x80000010  No    No   Cycles branch units are idle
PAPI_FXU_IDL 0x80000011  No    No   Cycles integer units are idle
PAPI_FPU_IDL 0x80000012  No    No   Cycles floating point units are idle
PAPI_LSU_IDL 0x80000013  No    No   Cycles load/store units are idle
PAPI_TLB_DM  0x80000014  Yes   Yes  Data translation lookaside buffer misses
PAPI_TLB_IM  0x80000015  Yes   No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  No    No   Total translation lookaside buffer misses
PAPI_L1_LDM  0x80000017  Yes   No   Level 1 load misses
PAPI_L1_STM  0x80000018  Yes   No   Level 1 store misses
PAPI_L2_LDM  0x80000019  Yes   No   Level 2 load misses
PAPI_L2_STM  0x8000001a  Yes   No   Level 2 store misses
PAPI_BTAC_M  0x8000001b  No    No   Branch target address cache misses
PAPI_PRF_DM  0x8000001c  Yes   No   Data prefetch cache misses
PAPI_L3_DCH  0x8000001d  No    No   Level 3 data cache hits
PAPI_TLB_SD  0x8000001e  No    No   Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f  No    No   Failed store conditional instructions
PAPI_CSR_SUC 0x80000020  No    No   Successful store conditional instructions
PAPI_CSR_TOT 0x80000021  No    No   Total store conditional instructions
PAPI_MEM_SCY 0x80000022  No    No   Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY 0x80000023  No    No   Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY 0x80000024  Yes   No   Cycles Stalled Waiting for memory writes
PAPI_STL_ICY 0x80000025  Yes   No   Cycles with no instruction issue
PAPI_FUL_ICY 0x80000026  Yes   Yes  Cycles with maximum instruction issue
PAPI_STL_CCY 0x80000027  Yes   No   Cycles with no instructions completed
PAPI_FUL_CCY 0x80000028  Yes   No   Cycles with maximum instructions completed
PAPI_HW_INT  0x80000029  No    No   Hardware interrupts
PAPI_BR_UCN  0x8000002a  Yes   Yes  Unconditional branch instructions
PAPI_BR_CN   0x8000002b  Yes   No   Conditional branch instructions
PAPI_BR_TKN  0x8000002c  Yes   Yes  Conditional branch instructions taken
PAPI_BR_NTK  0x8000002d  Yes   No   Conditional branch instructions not taken
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_BR_PRC  0x8000002f  Yes   Yes  Conditional branch instructions correctly predicted
PAPI_FMA_INS 0x80000030  No    No   FMA instructions completed
PAPI_TOT_IIS 0x80000031  No    No   Instructions issued
PAPI_TOT_INS 0x80000032  Yes   No   Instructions completed
PAPI_INT_INS 0x80000033  No    No   Integer instructions
PAPI_FP_INS  0x80000034  No    No   Floating point instructions
PAPI_LD_INS  0x80000035  Yes   No   Load instructions
PAPI_SR_INS  0x80000036  Yes   No   Store instructions
PAPI_BR_INS  0x80000037  Yes   No   Branch instructions
PAPI_VEC_INS 0x80000038  No    No   Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039  Yes   No   Cycles stalled on any resource
PAPI_FP_STAL 0x8000003a  No    No   Cycles the FP unit(s) are stalled
PAPI_TOT_CYC 0x8000003b  Yes   No   Total cycles
PAPI_LST_INS 0x8000003c  Yes   Yes  Load/store instructions completed
PAPI_SYC_INS 0x8000003d  No    No   Synchronization instructions completed
PAPI_L1_DCH  0x8000003e  No    No   Level 1 data cache hits
PAPI_L2_DCH  0x8000003f  No    No   Level 2 data cache hits
PAPI_L1_DCA  0x80000040  No    No   Level 1 data cache accesses
PAPI_L2_DCA  0x80000041  Yes   No   Level 2 data cache accesses
PAPI_L3_DCA  0x80000042  Yes   Yes  Level 3 data cache accesses
PAPI_L1_DCR  0x80000043  No    No   Level 1 data cache reads
PAPI_L2_DCR  0x80000044  Yes   No   Level 2 data cache reads
PAPI_L3_DCR  0x80000045  Yes   No   Level 3 data cache reads
PAPI_L1_DCW  0x80000046  No    No   Level 1 data cache writes
PAPI_L2_DCW  0x80000047  Yes   No   Level 2 data cache writes
PAPI_L3_DCW  0x80000048  Yes   No   Level 3 data cache writes
PAPI_L1_ICH  0x80000049  No    No   Level 1 instruction cache hits
PAPI_L2_ICH  0x8000004a  Yes   No   Level 2 instruction cache hits
PAPI_L3_ICH  0x8000004b  No    No   Level 3 instruction cache hits
PAPI_L1_ICA  0x8000004c  No    No   Level 1 instruction cache accesses
PAPI_L2_ICA  0x8000004d  Yes   No   Level 2 instruction cache accesses
PAPI_L3_ICA  0x8000004e  Yes   No   Level 3 instruction cache accesses
PAPI_L1_ICR  0x8000004f  No    No   Level 1 instruction cache reads
PAPI_L2_ICR  0x80000050  Yes   No   Level 2 instruction cache reads
PAPI_L3_ICR  0x80000051  Yes   No   Level 3 instruction cache reads
PAPI_L1_ICW  0x80000052  No    No   Level 1 instruction cache writes
PAPI_L2_ICW  0x80000053  No    No   Level 2 instruction cache writes
PAPI_L3_ICW  0x80000054  No    No   Level 3 instruction cache writes
PAPI_L1_TCH  0x80000055  No    No   Level 1 total cache hits
PAPI_L2_TCH  0x80000056  No    No   Level 2 total cache hits
PAPI_L3_TCH  0x80000057  No    No   Level 3 total cache hits
PAPI_L1_TCA  0x80000058  No    No   Level 1 total cache accesses
PAPI_L2_TCA  0x80000059  Yes   Yes  Level 2 total cache accesses
PAPI_L3_TCA  0x8000005a  Yes   No   Level 3 total cache accesses
PAPI_L1_TCR  0x8000005b  No    No   Level 1 total cache reads
PAPI_L2_TCR  0x8000005c  Yes   Yes  Level 2 total cache reads
PAPI_L3_TCR  0x8000005d  Yes   Yes  Level 3 total cache reads
PAPI_L1_TCW  0x8000005e  No    No   Level 1 total cache writes
PAPI_L2_TCW  0x8000005f  Yes   No   Level 2 total cache writes
PAPI_L3_TCW  0x80000060  Yes   No   Level 3 total cache writes
PAPI_FML_INS 0x80000061  No    No   Floating point multiply instructions
PAPI_FAD_INS 0x80000062  No    No   Floating point add instructions
PAPI_FDV_INS 0x80000063  No    No   Floating point divide instructions
PAPI_FSQ_INS 0x80000064  No    No   Floating point square root instructions
PAPI_FNV_INS 0x80000065  No    No   Floating point inverse instructions
PAPI_FP_OPS  0x80000066  No    No   Floating point operations
PAPI_SP_OPS  0x80000067  Yes   Yes  Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS  0x80000068  Yes   Yes  Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP  0x80000069  Yes   Yes  Single precision vector/SIMD instructions
PAPI_VEC_DP  0x8000006a  Yes   Yes  Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b  Yes   No   Reference clock cycles
-------------------------------------------------------------------------
Of 108 possible events, 60 are available, of which 16 are derived.

avail.c                                     PASSED

3. There are many counters available, but in practice, you can usually only use a few at a time, because not all events can be counted together. Decide which events you want to count, and then find out if they are compatible or not with the papi_event_chooser command. The example below shows an incompatibility.

% papi_event_chooser PAPI_LD_INS PAPI_SR_INS PAPI_L1_DCM PAPI_L1_ICH
 PAPI_L1_ICH
Event Chooser: Available events which can be added with given events.
--------------------------------------------------------------------------------
PAPI Version             : 5.2.0.0
Vendor string and code   : GenuineIntel (1)
Model string and code    : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz (79)
CPU Revision             : 1.000000
CPUID Info               : Family: 6  Model: 79  Stepping: 1
CPU Max Megahertz        : 2101
CPU Min Megahertz        : 1200
Hdw Threads per core     : 1
Cores per Socket         : 18
Sockets                  : 4
NUMA Nodes               : 2
CPUs per Node            : 36
Total CPUs               : 72
Running in a VM          : no
Number Hardware Counters : 11
Max Multiplex Counters   : 64
--------------------------------------------------------------------------------

Event PAPI_L1_ICH can't be counted with others -7

4. Set the COUNTER environment variables for compatible events of interest. It is recommended (or required) to set COUNTER1 to an available timer, such as GET_TIME_OF_DAY. For example:

% setenv COUNTER1 GET_TIME_OF_DAY
% setenv COUNTER2 PAPI_L1_DCM
% setenv COUNTER3 PAPI_L1_ICM
% setenv COUNTER4 PAPI_L2_DCM
% setenv COUNTER5 PAPI_L2_ICM

5. Run your application as usual. Following execution, you will have a unique directory for each PAPI event. Inside each directory, there will be a set of files named profile.#.* where # denotes the MPI rank. Viewing these files is discussed in the Output section below.

% ls
MULTI__GET_TIME_OF_DAY  MULTI__PAPI_L2_DCM  matmult.f90 
MULTI__PAPI_L1_DCM      MULTI__PAPI_L2_ICM  matmult.o
MULTI__PAPI_L1_ICM      Makefile            matmult       
% ls MULTI__PAPI_L1_DCM
profile.0.0.0  profile.2.0.0  profile.4.0.0  profile.6.0.0
profile.1.0.0  profile.3.0.0  profile.5.0.0  profile.7.0.0

Note: you can perform tracing at the same time as recording PAPI events by setting the TAU_TRACE environment variable to "1". You cannot, however perform normal TAU profiling at the same time as PAPI.

4. Selective and Manual Instrumentation

TAU provides the ability for users to customize their application's instrumentation, thereby enabling them to focus on specific areas of interest, and reduce run-time overhead associated with profiling the entire application. There are two ways to do this, as discussed below.

Selective Instrumentation

1. Create a text file that contains the names of routines and/or source files that should be instrumented or not instrumented. The type of instrumentation can also be specified: loops, memory, I/O, etc.

2. Build your instrumented executable, following steps 1 through 5 under Profiling with TAU Makefiles above, and be sure to do the following:

Use a TAU Makefile that includes -pdt in its name
Include the TAU compiler wrapper script option -optTauSelectFile=filename, where filename is the name of your selective instrumentation text file.

For additional details, including the required syntax for the selective instrumentation text file, see the TAU documentation at www.cs.uoregon.edu/research/tau/docs/newguide/bk01ch01s03.html

Manual Instrumentation

The TAU Instrumentation API provides a means for users to place TAU routines in their source code to explicitly direct how TAU should instrument their application. There are over 125 routines available. For details, see the TAU documentation at www.cs.uoregon.edu/research/tau/docs/newguide/bk03rn01.html.

Output

1. Profiling

TAU profiling output consists of a set of files named profile.X.Y.Z where:

X = MPI rank number

Y = context

Z = thread number

pprof

To get a quick, text based summary of your job's profile data, the TAU pprof utility can be used. By default, it will process all of the profile.* files in the current directory and produce a report showing profile data for each rank/context/thread. An example for one MPI rank is shown below:

% module load tau
% pprof
NODE 1;CONTEXT 0;THREAD 0:
-----------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call
-----------------------------------------------------------------------
100.0            5        4,489           1           1    4489358 .TAU application
99.9            5        4,484           1        4432    4484326 main
61.9        2,778        2,778         461           0       6028 multiply_matrices
32.8        1,472        1,473           1          44    1473099 MPI_Init()
  2.8          126          126        3010           0         42 MPI_Bcast()
  1.7           77           77         462           0        168 MPI_Recv()
  0.3           11           11         461           0         25 MPI_Send()
  0.3           11           11           1          48      11425 MPI_Finalize()
  0.0        0.171        0.171           5           0         34 MPI_Allgather()
  0.0        0.067         0.07           5          15         14 MPI_Comm_split()
  0.0        0.053        0.058          10          21          6 MPI_Comm_create()
  0.0        0.052        0.052          10           0          5 MPI_Allreduce()
  0.0        0.016        0.016          52           0          0 MPI_Errhandler_set()
  0.0         0.01         0.01          10           0          1 MPI_Group_incl()
  0.0        0.009        0.009          12           0          1 MPI_Comm_free()
  0.0        0.008        0.008           6           0          1 MPI_Type_contiguous()
  0.0        0.006        0.006          15           0          0 MPI_Group_free()
  0.0        0.004        0.004           4           0          1 MPI_Attr_put()
  0.0        0.004        0.004           5           0          1 MPI_Type_struct()
  0.0        0.003        0.003          13           0          0 MPI_Comm_rank()
  0.0        0.003        0.003          11           0          0 MPI_Type_commit()
  0.0        0.001        0.001           5           0          0 MPI_Comm_group()
  0.0            0            0           1           0          0 MPI_Comm_size()
-----------------------------------------------------------------------

USER EVENTS Profile :NODE 1, CONTEXT 0, THREAD 0
-----------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
-----------------------------------------------------------------------
         5          4          4          4          0  Message size for all-gather
        10          4          4          4          0  Message size for all-reduce
      3010    2.4E+04          4  2.392E+04       1381  Message size for broadcast
       462    2.4E+04          8  2.395E+04       1115  Message size received from all nodes
       461    2.4E+04    2.4E+04    2.4E+04          0  Message size sent to all nodes
-----------------------------------------------------------------------

ParaProf

ParaProf is TAU's graphical profile analysis utility. To use ParaProf:

1. Go to the directory containing the profile.* files.

2. If you have not already done so, load the TAU environment and then issue the paraprof command:

% module load tau
% paraprof

3. A manager window and profile summary window will appear. Left/right clicking on items in the summary window allows you to view different types of information, as do the various menu selections.

ParaProf includes many features for diving more deeply into your application's behavior - see the TAU Documentation for details. A few representative screenshots are provided below (click for a larger image).

TAU ParaProf Manager

Thread Bar Chart — TAU ParaProf node details

Function Bar Chart — TAU ParaProf function data window

3D Visualization — TAU ParaProf visualization

TAU ParaProf profile summary

Function Bar Chart for PAPI_L2_DCM Event — TAU ParaProf function data window

2. Tracing

TAU tracing output consists of two sets of files named tautrace.X.Y.Z and events.X.edf where:

X = MPI rank number

Y = context

Z = thread number

Before TAU's trace files can be viewed, they must first be merged (for parallel jobs) and then converted to a suitable format for viewing by selected trace viewing tools. Two trace viewing tools are covered here.

Vampir/VampirServer

The Vampir trace visualizer provides a variety of means of examining OTF trace data, as generated through VampirTrace, OpenSpeedShop, or TAU. VampirServer is a client/server version of Vampir that can quickly extract and analyze data from large trace files by using a parallel backend. See the Vampir documentation for details.

1. If you have not already done so, load the TAU and Vampir environments:

% module load tau
% module load vampir

2. Go to the directory containing the TAU tracefiles and issue the tau_treemerge.pl command. This will merge all tautrace.* and events.* files into a single tau.trc file and a single tau.edf file.

% tau_treemerge.pl
/usr/global/tools/tau/training/tau-2.23.1/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.1.edf
events.2.edf events.3.edf events.4.edf events.5.edf events.6.edf events.7.edf tautrace.0.0.0.trc
tautrace.1.0.0.trc tautrace.2.0.0.trc tautrace.3.0.0.trc tautrace.4.0.0.trc tautrace.5.0.0.trc
tautrace.6.0.0.trc tautrace.7.0.0.trc tau.trc
tautrace.0.0.0.trc: 51511 records read.
tautrace.1.0.0.trc: 21897 records read.
tautrace.2.0.0.trc: 21897 records read.
tautrace.3.0.0.trc: 21897 records read.
tautrace.4.0.0.trc: 21099 records read.
tautrace.5.0.0.trc: 21113 records read.
tautrace.6.0.0.trc: 21099 records read.
tautrace.7.0.0.trc: 21099 records read.

3. Convert the two merged TAU trace files into the Vampir otf format using the tau2otf utility:

% tau2otf tau.trc tau.edf matmult.otf

4. Launch Vampir using the name of your otf file:

% vampir matmult.otf

Or to generate OTF2 trace files that may be visualized using the Vampir trace visualizer, please use:

% module load vampir 
% setenv TAU_TRACE 1 
% setenv TAU_TRACE_FORMAT otf2 
% srun -n 4 tau_exec ./matmult 
% vampir matmult.otf2

5. The Vampir main window will appear, allowing you to examine your application's trace events. A few representative screenshots are shown below (click for a larger image).

Main Window - unzoomed — Vampir main window

Main Window - zoomed with event selected — Vampir main window with event selected

Jumpshot

The Jumpshot trace viewer from Argonne National Laboratory is included with the TAU installation.

1. If you have not already done so, load the TAU environment:

% module load tau

2. Go to the directory containing the TAU tracefiles and issue the tau_treemerge.pl command. This will merge all tautrace.* and events.* files into a single tau.trc file and a single tau.edf file.

% tau_treemerge.pl
/usr/global/tools/tau/training/tau-2.23.1/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.1.edf
events.2.edf events.3.edf events.4.edf events.5.edf events.6.edf events.7.edf tautrace.0.0.0.trc
tautrace.1.0.0.trc tautrace.2.0.0.trc tautrace.3.0.0.trc tautrace.4.0.0.trc tautrace.5.0.0.trc
tautrace.6.0.0.trc tautrace.7.0.0.trc tau.trc
tautrace.0.0.0.trc: 51511 records read.
tautrace.1.0.0.trc: 21897 records read.
tautrace.2.0.0.trc: 21897 records read.
tautrace.3.0.0.trc: 21897 records read.
tautrace.4.0.0.trc: 21099 records read.
tautrace.5.0.0.trc: 21113 records read.
tautrace.6.0.0.trc: 21099 records read.
tautrace.7.0.0.trc: 21099 records read.

3. Convert the two merged TAU trace files into the Jumpshot slog2 format using the tau2slog2 utility:

% tau2slog2 tau.trc tau.edf -o matmult.slog2

4. Launch Jumpshot using the name of your slog2 file:

% jumpshot matmult.slog2

5. The Jumpshot main window will appear, allowing you to examine your application's trace events. A few representative screenshots are shown below (click for a larger image).

Compiling and Linking

As discussed in the Quick Start section above, TAU can be used to instrument applications without any special need to compile or link. However, using some TAU features does require compiling and linking with TAU components. For the most part, all of this is accomplished by using TAU Makefiles and compiler wrappers as described in steps 1 through 5 under Profiling with TAU Makefiles. Additional information can be found in the TAU documentation.

Run-time Options

TAU has over 30 environment variables that can be used to control run-time behaviors such as:

Where to write profile/trace files
Which events/metrics to profile
Depth of call path to profile
Throttling
Verbosity/feedback
Sampling parameters
And more...

These are described in the TAU documentation at: https://www.cs.uoregon.edu/research/tau/docs/newguide/bk03apa.html.

Known Issues

On CORAL systems (lassen, rzansel, sierra), if you get the following error:

$ lrun -n 4 tau_exec a.out

a.out: error while loading shared libraries: libmpi.so.12: cannot open shared object file: No such file or directory

...

You may need to add "-T mpi,SPECTRUM" to your tau_exec command line:

$ lrun -n 4 tau_exec -T mpi,SPECTRUM a.out

rank 001 I am a worker: lassen405 (rank=1/4)

rank 000 I am the master: lassen405

rank 003 I am a worker: lassen405 (rank=3/4)

rank 002 I am a worker: lassen405 (rank=2/4)

If you are getting a segmentation fault with the -ebs option, then you will need to set your TAU_PROFILE_FORMAT environment variable to "merged" (export in bash or setenv in csh):

$ export TAU_PROFILE_FORMAT=merged

$ setenv TAU_PROFILE_FORMAT merged

Troubleshooting

TAU is a complex toolkit, and as such, troubleshooting problems may be difficult for the average user.
The most common problem is forgetting to load the TAU environment using the use tau command.
Most problems, if not easily resolved, should be reported to the LC Hotline.
These may be referred to the TAU development team under LC's support contract.

Documentation and References

The most important TAU links are listed below. Searching the web will find additional TAU documentation and presentations hosted by third parties.

TAU Home Page: www.cs.uoregon.edu/research/tau/home.php
TAU Documentation, including User Guides, Installation Guides, Reference Guides, Videos and more: www.cs.uoregon.edu/research/tau/docs.php