Intel Compiler Vectorization
Modern x86 processors include vector units that can operate on multiple data objects with a single instruction, otherwise known as Single Instruction, Multiple Data (or SIMD) units. These are implemented in the 128-bit Streaming SIMD Extensions (SSE) and starting with Intel's Sandy Bridge architecture, the 256-bit Advanced Vector eXtensions (AVX). The SSE instructions can perform four 32-bit (single precision) floating point operations or two 64-bit (double precision) floating point operations per clock cycle. The AVX instructions can perform eight 32-bit or four 64-bit floating point operations per clock cycle. It is thus important to use these vector instructions in order to achieve optimal hardware usage efficiency.
Most compilers will automatically generate vector instructions; however, they must be conservative in their analysis of the source code when determining when to apply vectorization. In particular, the compiler needs loop iterations to be operated on independently, that is, without data dependences across iterations, which often requires loops to be structured in a specific manner and may require assurances that there is no pointer aliasing that will create dependencies.
The Intel compilers can be accessed with the icc, icpc, and ifort commands for C, C++, and Fortran, respectively. These commands will run the LC "default" version. MPI wrappers exist for each command and can be accessed by prefixing the compiler command with "mpi" (i.e., mpiicc, mpiicpc, and mpiifort). Additional versions can be run by appending the version number to the compiler command, such as icc-19.1.0 or mpiifort-18.0.0, or the by loading version specific modules (i.e., module load intel/19.1.0) and then invoking icc, mpiifort, etc.
Compiler Auto Vectorization
The Intel compiler has several options for vectorization. One option is the -x flag, which tells the compiler to generate specific vectorization instructions. The -x flag takes a mandatory option, which can be AVX (i.e., -xAVX), SSE4.2, SSE4.1, SSE3, SSE2, etc. The processor on which a user runs must support the vectorization instructions specified. To determine what is supported, examine the /proc/cpuinfo file and look for avx or sse specifications in the flags category. LC's Peloton and TLCC1 AMD Opteron clusters support SSE2, Intel Westmere-based clusters support SSE2, SSE3, SSE4.1, and SSE4.2, and Intel Sandy Bridge-based TLCC2 clusters support SSE2, SSE3, SSE4.1, SSE4.2, and AVX. Using the -xHost flag enables the highest level of vectorization supported on the processor on which the user compiles. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2 or higher. Disable this by specifying -no-vec.
The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, ..., SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -axSSE4.2 -xSSE2. In this case, when run on an AMD Opteron processor, the baseline SSE2 execution path will be taken. When run on an Intel Westmere processor, the SSE4.2 execution path will be taken. When run on an Intel Sandy Bridge processor, the AVX execution path will be taken.
Another useful option for the Intel compiler is the -vec-report flag, which generates diagnostic information regarding vectorization to stdout. The -vec-report flag takes an optional parameter that can be a number between 0 and 5 (e.g., -vec-report0), with 0 disabling diagnostics and 5 providing the most detailed diagnostics about what loops were optimized, what loops were not optimized, and why those loops were not optimized. The output can be useful to identify possible strategies to get a loop to vectorize.
Refer to the man pages or the compiler documentation for more details about these options.
Guided Auto Parallelization
The Intel compiler includes a Guided Auto Parallelization (GAP) feature that can help analyze source code and generate advice on how to obtain better performance. In particular, GAP will suggest code changes or compiler options that will lead to better vectorized code. GAP may optionally allow the user to take advantage of the auto-parallelization capability that can generate multithreaded code for independent loop iterations; however, developers are encouraged to use explicit thread parallelism through mechanisms like OpenMP.
The GAP feature can be accessed by adding the -guide option, which takes an optional =# parameter, where # can be a number between 1 and 4 with 1 being the lowest level of guidance and 4 (the default) being the most advanced level of guidance. The compiler will print a GAP report to stderr or it can be redirected to a file with the -guide-file=filename option, which will send the output to the file name filename, or the -guide-file-append=filename option, which will append to the specified file. The GAP analysis can be targeted to a specific file, function, or source line with the -guide-opt=specification option. Refer to the compiler man pages or documentation for details on this option. More details are also available on Intel's Web site.
Documentation for the Intel C/C++ compiler can be found in/usr/tce/packages/intel/default/documentation_2019/en/compiler_c/ps2019/get_started_lc.htm , on Intel's Web site, or by running man icc or man icpc. Documentation for the Intel Fortran compiler can be found in /usr/tce/packages/intel/default/documentation_2019/en/compiler_f/ps2019/get_started_lf.htm or on Intel's Web site or by running man ifort. Documentation specific to the Guided Auto Parallelization feature can be found on Intel's Web site.