Livermore Computing Linux Commodity Clusters Overview Part One

Table of Contents

Part One

  1. Abstract
  2. Background of Linux Commodity Clusters at LLNL
  3. Commodity Cluster Configurations and Scalable Units
  4. LC Linux Commodity Cluster Systems
  5. Intel Xeon Hardware Overview
  6. Infiniband Interconnect Overview
  7. Software and Development Environment
  8. Compilers
  9. Exercise 1
  10. MPI
  11. Running Jobs
    1. Overview
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Interacting With Jobs
    5. Optimizing CPU Usage
    6. Memory Considerations
    7. Vectorization and Hyper-threading
    8. Process and Thread Binding
  12. Debugging
  13. Tools
  14. Exercise 2

Or go to Part Two

  1. GPU Clusters
    1. Available GPU Clusters
    2. Hardware Overview
    3. GPU Programming APIs
      1. CUDA (APIs)
      2. OpenMP (APIs)
      3. OpenACC (APIs)
      4. OpenCL (APIs)
    4. Compiling
      1. CUDA
      2. OpenMP
      3. OpenACC
      4. Misc. Tips & Tools
    5. References and More Information

 Abstract


This tutorial is intended to be an introduction to using LC's "Commodity" Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Slurm and Moab

 Background of Linux Commodity Clusters at LLNL

The Linux Project

  • LLNL first began experimenting with Linux clusters in 1999-2000 in a partnership with Compaq and Quadrics to port Quadrics software to Alpha Linux.
  • The Linux Project was started for several reasons:
    • Cost: price-performance analysis demonstrated that near-commodity hardware in clusters running Linux could be more cost-effective than proprietary solutions;
    • Focus: the decreasing importance of high-performance computing (HPC) relative to commodity purchases was making it more difficult to convince proprietary systems vendors to implement HPC specific solutions;
    • Control: it was believed that by controlling the OS in-house, Livermore Computing could better support its customers;
    • Community: the platform created could be leveraged by the general HPC community.
  • The objective of this effort was to apply LC's scalable systems strategy (the "Livermore Model") to commodity hardware running the open source Linux OS:
    • Based on SMP compute nodes attached to a high-speed, low-latency interconnect.
    • Uses OpenMP to exploit SMP parallelism within a node and MPI to exploit parallelism between nodes.
    • Provides a POSIX interface parallel filesystem.
    • Application toolset: C, C++ and Fortran compilers, scalable MPI/OpenMP GUI debugger, performance analysis tools.
    • System management toolset: parallel cluster management tools, resource management, job scheduling, near-real-time accounting.

Alpha Linux Clusters

  • The first Linux cluster implemented by LC was LX, a Compaq Alpha Linux system with no high-speed interconnect.
  • The first production Alpha cluster targeted to implement the full Livermore Model was Furnace, a 64-node system comprised of dual-CPU EV68 processors with a QSnet interconnect. However...
    • Compaq announced the eventual discontinuation of the Alpha server line
    • Intel Pentium 4 with favorable SPECfp performance was released just as Furnace was delivered.
  • This prompted Livermore to shift to an Intel IA32-based model for its Linux systems in July 2001.
  • Furnace's interconnect was allocated to the IA32-based PCR clusters (below) instead. It then operated as a loosely coupled cluster until it was decommissioned in 10/03.

PCR Clusters

  • August 2001: The Parallel Capacity Resource (PCR) clusters were purchased from Silicon Graphics and Linux NetworX. Consisted of:
    • Adelie: 128-node production cluster
    • Emperor: 88-node production cluster
    • Dev: 26-node development cluster
  • Each PCR compute node had two 1.7-GHz Intel Pentium 4 CPUs and a QsNet Elan3 interconnect.
  • Parallel file system was not implemented at that time - instead dedicated BlueArc NFS servers were used.
  • SCF resource only
  • The 16-node Pengra cluster was procured for the OCF to provide a less restrictive development environment for PCR related work in July, 2002.
  • For more information see the: 2002 Linux Project Report.

MCR Cluster...and More

  • The success of the PCR clusters was followed by the purchase of the Multiprogrammatic Capability Resource (MCR) cluster in July, 2002 from Linux NetworX.
  • 1152 node cluster comprised of dual-processor, 2.4 GHz Intel Xeons
  • MCR's procurement was intended to significantly increase the resources available to Multiprogrammatic and Institutional Computing (M&IC) users.
  • MCR's configuration included the first production implementation of the Lustre parallel file system, an integral part of the "Livermore Model".
  • Debuted as #5 on the Top500 Supercomputers list in November, 2002, and then peaked at #3 in June, 2003.
  • For more information see: MCR Background
  • Convinced of the success of this path, LC implemented several other IA-32 Linux clusters simultaneously with, or after, the MCR Linux cluster:
    SystemNetworkNodesCPUs/CoresGflops
    ALCOCF9601,9209,216
    LILACSCF7681,5369,186
    ACESCF1603201,792
    SPHEREOCF961921,075
    GVIZSCF64128717
    ILXOCF67134678
    PVCOCF64128614

Which Led To Thunder...

In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.


Thunder

The Peloton Systems


Atlas Peloton Cluster
  • In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.
  • Appro was awarded the contract in June, 2006.
  • Peloton clusters were built in 5.5 teraflop "scalable units" (SU) of ~144 nodes
  • All Peloton clusters used AMD dual-core Socket F Opterons:
    • 8 cpus per node
    • 2.4 GHz clock
    • Option to upgrade to 4-core Opteron "Deerhound" later (not taken)
  • The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC, Capability and Capacity:
    SystemNetworkNodesCoresTeraflops
    AtlasOCF1,1529,21644.2
    MinosSCF8646,91233.2
    RheaSCF5764,60822.1
    ZeusOCF2882,30411.1
    YanaOCF836403.1
    HopiSCF766082.9
  • The last Peloton clusters were retired in June 2012.

 

And Then, TLCC and TLCC2


Juno, Eos TLCC Clusters
  • In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.
  • The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.
  • The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.
  • TLCC clusters were/are:
    SystemNetworkNodesCoresTeraflops
    JunoSCF1,15218,432162.2
    HeraOCF86413,824127.2
    EosSCF2884,60840.6
  • In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:
  • The TLCC2 systems consist of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:
    SystemNetworkNodesCoresTeraflops
    ZinSCF2,91646,656961.1
    CabOCF-CZ1,29620,736426.0
    RzmerlOCF-RZ1622,59253.9
    PinotSNSI1622,59253.9
  • Additionally, LC procured other Linux clusters similar to TLCC2 systems for various purposes.

Commodity Technology Systems (CTS)

  • CTS systems are the follow-on to TLCC2 systems.
  • CTS-1 systems become available in late 2016 - early 2017. These systems are based on Intel Broadwell E5-2695 v4 processors, 36 cores per node, 128 GB node memory, with Intel Omni-Path 100 Gb/s interconnect. They include:
    SystemNetworkNodesCoresTeraflops
    AgateSCF481,72858.1
    BoraxOCF-CZ481,72858.1
    JadeSCF2,68896,7683,251.4
    MicaSCF38413,824530.8
    QuartzOCF-CZ3,072110,5923,715.9
    RZGenieOCF-RZ481,72858.1
    RZTopazOCF-RZ76827,648464.5
    RZTronaOCF-RZ2072024.2
  • CTS-2 systems are expected to start becoming available in the 2020-2021 time frame.

Zin TLCC2 Cluster

Quartz CTS-1 Cluster

Cluster Configurations and Scalable Units

Basic Components

  • Currently, LC has several types of production Linux clusters based on the following processor architectures:
    • Intel Xeon 18-core E5-2695 v4 (Broadwell)
    • Intel Xeon 8-core E5-2670 (Sandy Bridge - TLCC2) w/without NVIDIA GPUs
    • Intel Xeon 12-core E5-2695 v2 (Ivy Bridge)
  • All of LC's Linux clusters differ in their configuration details, however they do share the same basic hardware building blocks:
    • Nodes
    • Frames / racks
    • High speed interconnect (most clusters)
    • Other hardware (file systems, management hardware, etc.)

Nodes

  • The basic building block of a Linux cluster is the node. A node is essentially an independent computer. Key features:
    • Self-contained, diskless, multi-core computer.
    • Low form-factor - Clusters nodes are very thin in order to save space.
    • Rack Mounted - Nodes are mounted compactly in a drawer fashion to facilitate maintenance, reduced footprint, etc.
    • Remote Management - There is no keyboard, mouse, monitor or other device typically used to interact with a computer. All node management occurs over the network from a "management" node.
  • Examples (click for larger image):

Single compute node - TLCC2


Single compute node - CTS-1
 
  • In general, an LC production cluster has four types of nodes, based upon function, which can differ in configuration details:

  • Login
  • Interactive/debug
  • Batch
  • I/O and service nodes (unavailable to users)
  • Login nodes:
    • Every system has a designated number of login nodes - depends upon the size of the system. Some examples:
      • agate = 2
      • sierra = 5
      • quartz = 14
      • zin = 20
    • Login nodes are shared by multiple users
    • Primarily used for interactive work such as editing files, submitting batch jobs, compiling, running GUIs, etc.
    • Interactive use exclusively - login only nodes do not permit any batch jobs.
    • DO NOT run production jobs on login nodes! Remember, you are sharing login nodes with other users.
  • Interactive/debug (pdebug) nodes:
    • Most LC systems have nodes that are designated for interactive work.
    • Meant for testing, prototyping, debugging, and small, short jobs
    • Cannot be logged into unless you already have a job running on them
    • Nodes run one job at a time - not shared like login nodes
    • Can also be used through the batch system
  • Batch (pbatch) nodes:
    • Comprise the majority of nodes on each system
    • Meant for production work
    • Work is submitted via a batch scheduler (Slurm, Moab)
    • Cannot be logged into unless you already have a job running on them
    • Nodes run one job at a time - not shared like login nodes

Frames / Racks

  • Frames are the physical cabinets that hold most of a cluster's components:
    • Nodes of various types
    • Switch components
    • Other network components
    • Parallel file system disk resources (usually in separate racks)
  • Vary in size/appearance between the different Linux clusters at LC.
  • Power and console management - frames include hardware and software that allow system administrators to perform most tasks remotely.
  • Example images below (click for larger image):

    Frames—TLCC2

    Frames—CTS-1

Scalable Unit

  • The basic building block of LC's production Linux clusters is called a "Scalable Unit" (SU). An SU consists of:
    • Nodes (compute, login, management, gateway)
    • First stage switches that connect to each node directly
    • Miscellaneous management hardware
    • Frames sufficient to house all of the hardware
    • Additionally, second stage switch hardware is needed to connect multi-SU clusters (not shown).
  • The number of nodes in an SU depends upon the type of switch hardware being used. For example:
    • QLogic = 162 nodes
    • Intel Omni-Path = 192 nodes
  • Multiple SUs are combined to create a cluster. For example:
    • 2 SU = 324 / 384 nodes
    • 4 SU = 648 / 768 nodes
    • 8 SU = 1296 / 1536 nodes
  • The SU design is meant to:
    • Standardize configuration details across the enterprise
    • Easily "grow" clusters in incremental units
    • Leverage procurements and reduce costs across the Tri-labs
  • An example of a 2 SU cluster is shown below for illustrative purposes. Note that a frame holding the second level switch hardware is not shown.

     

LC Linux Commodity Cluster Systems

LC Linux Clusters Summary

  • The table below summarizes the key characteristics of LC's Linux commodity clusters.
  • Note that some systems are limited access and not Generally Available (GA)

     

Intel Xeon Hardware Overview

Intel Xeon Processor

  • LC has a long history of Linux clusters using Intel processors.
  • Currently, several different types of Xeons are used in LC clusters. Some representative Xeons are discussed below.

Xeon E5-2670 Processor

  • This is the Intel "Sandy Bridge EP" product.
  • Used in the Tri-lab TLCC2 clusters
  • 64-bit, x86 architecture
  • Clockspeed: 2.6 GHz (at LC)
  • 8-core (at LC)
  • Two threads per core (hyper-threading)
  • Cache:
    • L1 Data: 32 KB, private
    • L1 Instruction: 32 KB, private
    • L2: 256 KB, private
    • L3: 20 MB, shared
  • Memory bandwidth: 51.2 GB/sec
  • "Turbo Boost" Technology: automatically allows processor cores to run faster than the base operating frequency if they're operating below power, current, and temperature specification limits. For additional information see: http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html.
  • Intel AVX (Advanced Vector eXtensions) Instructions:
    • New and enhanced instruction set for 256-bit wide vector SIMD operations
    • Designed for floating point intensive applications
    • Operate on 4 double precision or 8 single precision operands.
    • Details: Introduction to Intel Advanced Vector Extensions
  • Xeon E5-2670 Spec Sheet

Intel Sandy Bridge EP(Image source: David Kanter)

Intel Sandy Bridge EP(Image source: Intel)

    Xeon E5-2695 v4 Processor

    • This is the Intel "Broadwell" product.
    • Used in the Tri-lab CTS-1 clusters
    • 64-bit, x86 architecture
    • Clockspeed: 2.1 GHz (at LC)
    • 18-core (at LC)
    • Two threads per core (hyper-threading)
    • Cache:
      • L1 Data: 32 KB, private
      • L1 Instruction: 32 KB, private
      • L2: 256 KB, private
      • L3: 45 MB, shared
    • Memory bandwidth: 76.8 GB/s
    • Turbo Boost Technology
    • Intel AVX (Advanced Vector eXtensions) instruction
    • Xeon E5-2695 v4 Spec Sheet

    Additional Information

    Infiniband Interconnect Overview

    Interconnects

    • Types of interconnects:
      • Varies by cluster; a few clusters do not have interconnects.
      • CTS-1 clusters use Intel Omni-Path switches and adapters.
      • Most other Intel Xeon clusters use 4x QDR QLogic InfiniBand switches and adapters.
    • Bandwidths:
      • 4x = 4 times the base InfiniBand link rate of 2.5 Gbits/sec, which equals 10 Gbits/sec, full duplex.
      • SDR (Single Data Rate) = 10 Gbits/sec
      • DDR (Double Data Rate) = 20 Gbits/sec
      • QDR (Quad Data Rate) = 40 Gbits/sec
      • Intel Omni-Path = 100 Gbits/sec

    Primary components

    • Adapter Card:
      • Communications processor packaged on network PCI Express adapter card.
      • Remote Direct Memory Access (RDMA) improves communication bandwidth by off-loading communications from the CPU.
      • Provides the interface between a node and a two-stage network.
      • Connected to a first stage switch by copper cable (most cases) or optic fiber.
      • Types: Intel Omni-Path, QLogic 4x QDR IB

    Omni-Path Fabric Adapter
    (Image source: Intel)
     

    QLogic IB Adapter
    (Image source: QLogic)
     
    • 1st Stage Switch:
      • Intel Omni-Path 48-port: 32 ports connect to adapters in nodes and 16 ports connect to second stage switches.
      • QLogic QDR 36-port: 18 ports connect to adapters in nodes and 18 ports connect to second stage switches.
    • 2nd Stage Switch:
      • Intel Omni-Path 768-port: all used ports connect to a first stage switches via optic fiber cabling.
      • QLogic QDR 18-864 port: all used ports connect to a first stage switches via optic fiber cabling.
    • Example image below (click for a larger image):

    Topology

    • Two-stage, federated, bidirectional, fat-tree.
    • The number of second stage switches depends upon the number of scalable units (SUs) that comprise the cluster and the type of switch hardware used.
    • Example:

      2688-way Interconnect | Quartz/Jade - 14 SU

    Performance:

    • The inter-node bandwidth measurements below were taken on live, heavily loaded, LC machines using a simple MPI non-blocking test code. One task on each of two nodes. Not all systems are represented. Your mileage may vary.
      System TypeLatencyBandwidth
      Intel Xeon Clusters with QDR QLogic~1-2 us~4.1 GB/sec
      Intel Xeon Clusters with QDR QLogic (TLCC2)~1 us~5.0 GB/sec
      Intel Xeon Clusters with Intel Omni-Path (CTS-1)~1 us~21 GB/sec

    Software and Development Environment

    This section only provides a summary of the software and development environment for LC's Linux commodity clusters. Please see the Livermore Computing Resources and Environment tutorial for details.

    TOSS Operating System

    • All LC Linux clusters use TOSS (Tri-Laboratory Operating System Stack).
    • The primary components of TOSS include:
      • Red Hat Enterprise Linux (RHEL) distribution with modifications to support targeted HPC hardware and cluster computing
      • RHEL kernel optimized for large scale cluster computing
      • OpenFabrics Enterprise Distribution InfiniBand software stack including MVAPICH and OpenMPI libraries
      • Slurm Workload Manager
      • Integrated Lustre and Panasas parallel file system software
      • Scalable cluster administration tools
      • Cluster monitoring tools
      • GNU, C, C++ and Fortran90 compilers (GNU, Intel, PGI)
      • Testing software framework for hardware and operating system validation

    Batch Systems

    File Systems

    • Home directories:
      • Globally mounted under /g/g#
      • Backed up
      • Not purged
      • 16 GB quota in effect
      • Convenient .snapshot directory for recent backups
    • Lustre parallel file systems:
      • Mounted under /p/lustre#
      • Very large
      • Not backed up
      • Quota in effect
      • Shared by all users on a cluster or multiple clusters
      • Lustre is discussed in the Parallel File Systems section of the Introduction to Livermore Computing Resources tutorial.
      • Are usually mounted by multiple clusters.
    • /usr/workspace - 1 TB NFS file system available for each user and each group. Similar to home directories, but not backed up to tape.
    • /var/tmp, /usr/tmp, /tmp - different names for the same file system, local (non-NFS) mounted, moderate size, not backed up, purged, shared by all users on a given node.
    • Archival HPSS storage - accessed by ftp storage. Virtually unlimited file space, not backed up or purged. More info: https://hpc.llnl.gov/software/archival-storage-software.
    • /usr/gapps - globally mounted project workspace that may be group and/or world readable. More info: hpc.llnl.gov/hardware/file-systems/usr-gapps-file-system.

    Modules

    • LC's Linux commodity clusters support the Lmod Modules package.
      • Provide a convenient, uniform way to select among multiple versions of software installed on LC systems.
      • Many LC software applications require that you load a particular "package" in order to use the software.
    • Using Modules:
      List available modules:     module avail
      Load a module:              module add|load modulefile
      Unload a module:            module rm|unload modulefile
      List loaded modules:        module list
      Read module help info:      module
      Display module contents:    module display|show modulefile
      
    • For more information see:

    Dotkit

    • Dotkit is no longer used on LC production Linux clusters. It has been replaced by Lmod Modules (see above).

    Compilers, Tools, Graphics and Other Software

    The table below lists and provides links to the majority of software available through LC or related organizations.

    Software CategoryDescription and More Information
    CompilersLists which compilers are available for each LC system:
    https://hpc.llnl.gov/software/development-environment-software/compilers
    Supported Software and Computing ToolsDevelopment Environment Group supported software includes compilers, libraries, debugging, profiling, trace generation/visualization, performance analysis tools, correctness tools, and several utilities:
    https://hpc.llnl.gov/software/development-environment-software.
    Graphics SoftwareGraphics Group supported software includes visualization tools, graphics libraries, and utilities for the plotting and conversion of data:
    https://hpc.llnl.gov/data-vis/vis-software
    Mathematical Software OverviewLists and describes the primary mathematical libraries and interactive mathematical tools available on LC machines:
    https://hpc.llnl.gov/software/mathematical-software
    LINMathThe Livermore Interactive Numerical Mathematical Software Access Utility, is a Web-based access utility for math library software. The LINMath Web site also has pointers to packages available from external sources:
    https://www-lc.llnl.gov/linmath/
    Center for Applied Scientific Computing (CASC) SoftwareA wide range of software available for download from LLNL's CASC. Includes mathematical software, language tools, PDE software frameworks, visualization, data analysis, program analysis, debugging, and benchmarks:
    https://computing.llnl.gov/hpc/software
    https://software.llnl.gov
    LLNL Software PortalLab-wide portal of software repositories:
    https://software.llnl.gov/

      Atlassian Tools

      • LC supports a suite of web-based collaboration tools from Atlassian:
        • Confluence Wiki: used for documentation, collaboration, knowledge sharing, file sharing, mockups, diagrams... anything you can put on a webpage.
        • JIRA: issue tracking and project management system
        • Bitbucket: for git repository hosting. Similar to popular sites like GitHub and Bitbucket, but it is intended for internal use on intranets.
        • Bamboo: a continuous integration and delivery tool that combines automated builds, tests, and releases in a single workflow.
      • All three collaboration tools:
        • Are based on LC usernames / groups and are intended to foster collaboration between LC users working on HPC projects.
        • Are installed on the CZ, RZ and SCF networks
        • Require authentication with your LC username and RSA PIN + token
        • Have a User Guide for usage information
      • Locations:
        NetworkConfluence WikiJIRABitBucket
        CZhttps://lc.llnl.gov/confluence/https://lc.llnl.gov/jira/lc.llnl.gov/bitbucket/
        RZhttps://rzlc.llnl.gov/confluence/https://rzlc.llnl.gov/jira/rzlc.llnl.gov/bitbucket/
        SCFhttps://lc.llnl.gov/confluence/https://lc.llnl.gov/jira/lc.llnl.gov/bitbucket/

      Spack

      • Spack is a flexible package manager for HPC
      • Easy to download and install. For example:
        % . spack/share/spack/setup-env.csh  (or setup-env.sh)
        
      • There is an increasing number of software packages (over 3,000 currently) available for installation with Spack. Many open source contributions from the international community.
      • To view available packages: spack list
      • Then, to install a desired package: spack install packagename
      • Additional Spack features:
        • Allows installations to be customized. Users can specify the version, build compiler, compile-time options, and cross-compile platform, all on the command line.
        • Allows dependencies of a particular installation to be customized extensively.
        • Non-destructive installs - Spack installs every unique package/dependency configuration into its own prefix, so new installs will not break existing ones.
        • Creation of packages is made easy.
      • Extensive documentation is available at: https://spack.readthedocs.io

      Compilers

      General Information

      Available Compilers and Invocation Commands

      • The table below summarizes compiler availability and invocation commands on LC Linux clusters.
      • Note that parallel compiler commands are actually LC scripts that ultimately invoke the corresponding serial compiler.
      • For details on the MPI parallel compiler commands, see https://computing.llnl.gov/tutorials/mpi/#LLNL.
        Linux Cluster Compilers
        CompilerSerial CommandParallel Commands
        IntelCiccmpicc
        C++icpcmpicxx, mpic++
        Fortranifortmpif77, mpif90, mpifort
        GNUCgccmpicc
        C++g++mpicxx, mpic++
        Fortrangfortranmpif77, mpif90, mpifort
        PGICpgccmpicc
        C++pgc++mpicxx, mpic++
        Fortranpgf77, pgf90, pgfortranmpif77, mpif90, mpifort
        LLVM/ClangCclangmpicc
        C++clang++mpicxx, mpic++

      Compiler Versions and Defaults

      • LC maintains multiple versions of each compiler.
      • The Modules module avail command is used to list available compilers and versions:
        module avail intel
        module avail gcc
        module avail pgi
        module avail clang
      • Versions: to determine the actual version you are using, issue the compiler invocation command with its "version" option. For example:
        CompilerOptionExample
        Intel--versionifort --version
        GNU--versiong++ --version
        PGI-Vpgf90 -V
        Clang--versionclang --version
      • Using an alternate version: issue the Modules command:
        module load module-name

      Compiler Options

      • Each compiler has hundreds of options that determine what the compiler does and how it behaves.
      • The options used by one compiler mostly differ from other compilers.
      • Additionally, compilers have different default options.
      • An in-depth discussion of compiler options is beyond the scope of this tutorial.
      • See the compiler's documentation, man pages, and/or -help or --help option for details.

      Compiler Documentation

      • Intel and PGI: compiler docs are included in the /opt/compilername directory. Otherwise, see Intel or PGI web pages.
      • GNU: see the web pages at https://gcc.gnu.org/
      • LLVM/Clang: see the web pages at http://clang.llvm.org/docs/
      • Man pages may/may not be available

      Optimizations

      • All compilers are able to perform optimizations, though they will differ between compilers even though the compiler flags appear to be the same.
      • Optimizations are intended to make codes run faster, though this isn't guaranteed.
      • Some optimizations "rewrite" your code, and can make debugging difficult, since the source may not match the executable.
      • Optimizations can also produce wrong results, reduced precision, increased compile times and increased executable size.
      • The table below summarizes common compiler optimization options. See the compiler documentation for details and other optimization options.
        OptimizationIntelGNUPGI
        -OSame as O2Same as O1O1 + global optimizations. No SIMD.
        -O0No optimizationDEFAULT. No optimization. Same as omitting any -O flag.No optimization
        -O1Optimize for size: basic optimizations to create smallest codeReduce code size and execution time, without performing any optimizations that take a great deal of compilation time.Local optimizations, block scheduling and register allocation.
        -O2DEFAULT. Optimize for speed: O1 + additional optimizations such as basic loop and vectorizationOptimize even more. O1 + nearly all supported optimizations that do not involve a space-speed tradeoff.
        -O3O2 + aggressive loop optimizations. Recommended for loop dominated codes.O2 + further optimizationsO2 + aggressive global optimizations
        -O4n/an/aO3 + hoisting of guarded invariant floating point expressions
        -OfastSame as O3 (mostly)Same as O3 + optimizations that disregard strict standards compliance.n/a
        -fastO3 + several additional optimizationsn/aGenerally specifies global optimization. Actual optimizations vary from release to release.
        -Ogn/aEnables optimizations that do not interfere with debugging.n/a
        Optimization / Vectorization report-opt-report
        -vec-report
        -ftree-vectorizer-verbose=[1-7]
        -ftree-vectorizer-verbose=7
        -Minfo=[option]
        -Minfo=all

      Floating-point Exceptions

      • The IEEE floating point standard defines several exceptions (FPEs) that occur when the result of a floating point operation is unclear or undesirable:
        • overflow: an operation's result is too large to be represented as a float. Can be trapped, or else returned as a +/- infinity.
        • underflow: an operation's result is too small to be represented as a normalized float. Can be trapped, or else represented as as a denormalized float (zero exponent w/ non-zero fraction) or zero.
        • divide-by-zero: attempting to divide a float by zero. Can be trapped, or else returned as a +/- infinity.
        • inexact: result was rounded off. Can be trapped or returned as rounded result.
        • invalid: an operation's result is ill-defined, such as 0/0 or the sqrt of a negative number. Can be trapped or returned as NaN (not a number).
      • By default, the Xeon processors used at LC mask/ignore FPEs. Programs that encounter FPEs will not terminate abnormally, but instead, will continue execution with the potential of producing wrong results.
      • Compilers differ in their ability to handle FPEs. See the relevant compiler documentation for details.

      Precision, Performance and IEEE 754 Compliance

      • Typically, most compilers do not guarantee IEEE 754 compliance for floating-point arithmetic unless it is explicitly specified by a compiler flag. This is because compiler optimizations are performed at the possible expense of precision.
      • Unfortunately for most programs, adhering to IEEE floating-point arithmetic adversely affects performance.
      • If you are not sure whether your application needs this, try compiling and running your program both with and without it to evaluate the effects on both performance and precision.
      • See the relevant compiler documentation for details.

      Mixing C and Fortran

      • Modern FORTRAN - FORTRAN 20XX provides much support for  interoperability with C/C++  - see the respective compiler documentation.
      • If you are linking C/C++ and FORTRAN90/77 code together, and need to explicitly specify the FORTRAN or C/C++ libraries on the link line, LC provides a general recommendation and example in the /usr/local/docs/linux.basics file. See the "MIXING C AND FORTRAN" section.
      • All of the other issues involved with mixed language programming apply, such as:
        • Column-major vs. row-major array ordering
        • Routine name differences - appended underscores
        • Arguments passed by reference versus by value
        • Common blocks vs. extern structs
        • Memory alignment differences
        • File I/O - Fortran unit numbers vs. C/C++ file pointers
        • C++ name mangling
        • Data type differences
      • Some useful references:

      Linux Clusters Overview Exercise 1

      Getting Started

      Overview:

      • Login to an LC cluster using your workshop username and OTP token
      • Copy the exercise files to your home directory
      • Familiarize yourself with the cluster's configuration
      • Familiarize yourself with available compilers
      • Build and run serial applications
      • Compare compiler optimizations

      GO TO THE EXERCISE HERE