Using LC's Sierra Systems

Table of Contents

  1. Abstract
  2. Sierra Overview
    1. CORAL
    2. CORAL Early Access Systems
    3. Sierra Systems
  3. Hardware
    1. Sierra Systems General Configuration
    2. IBM POWER8 Architecture
    3. IBM POWER9 Architecture
    4. NVIDIA Tesla P100 (Pascal) Architecture
    5. NVIDIA Tesla V100 (Volta) Architecture
    6. NVLink
    7. Mellanox EDR InfiniBand Network
    8. NVMe PCIe SSD (Burst Buffer)
  4. Accounts, Allocations and Banks
  5. Accessing LC's Sierra Machines
  6. Software and Development Environment - Summary and links to further information about:
    1. Login Nodes and Launch Nodes
    2. Login Shells and Files
    3. Operating System
    4. Batch System
    5. File Systems
    6. HPSS Storage
    7. Modules
    8. Compilers
    9. Math Libraries
    10. Debuggers, Performance Analysis Tools
    11. Visualization Software
  7. Compilers
    1. Wrapper Scripts
    2. Versions
    3. Selecting Your Compiler Version
    4. IBM XL Compilers
    5. Clang Compiler
    6. GNU Compilers
    7. PGI Compilers
    8. NVIDIA NVCC Compiler
  8. MPI
  9. OpenMP
  10. System Configuration and Status Information
  11. Running Jobs
    1. Overview
    2. Batch Scripts and #BSUB / bsub
    3. Interactive Jobs
    4. jsrun Command and Resource Sets
    5. Job Dependencies
  12. Monitoring Jobs
  13. Interacting With Jobs
    1. Suspending / Resuming Jobs
    2. Modifying Jobs
    3. Signaling / Killing Jobs
  14. LSF - Additional Information
    1.  LSF Documentation
    2. LSF Configuration Commands
  15. Math Libraries
  16. Code-Correctness Tools
  17. Debugging
  18. Performance Analysis Tools
  19. References & Documentation

Abstract

This tutorial is intended for users of Livermore Computing's Sierra systems. It begins by providing a brief background on CORAL, leading to the CORAL EA and Sierra systems at LLNL. The CORAL EA and Sierra hybrid hardware architectures are discussed, including details on IBM POWER8 and POWER9 nodes, NVIDIA Pascal and Volta GPUs, Mellanox network hardware, NVLink and NVMe SSD hardware.

Information about user accounts and accessing these systems follows. User environment topics common to all LC systems are reviewed. These are followed by more in-depth usage information on compilers, MPI and OpenMP. The topic of running jobs is covered in detail in several sections, including obtaining system status and configuration information, creating and submitting LSF batch scripts, interactive jobs, monitoring jobs and interacting with jobs using LSF commands.

A summary of available math libraries is presented, as is a summary on parallel I/O. The tutorial concludes with discussions on available debuggers and performance analysis tools.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the Sierra environment. A basic understanding of parallel programming in C or Fortran is required. Familiarity with MPI and OpenMP is desirable. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.

Sierra Overview

CORAL:

CORAL Early Access (EA) Systems:

  • In preparation for the final delivery Sierra systems, LLNL has implemented three "early access" systems, one on each network:
    • ray - OCF-CZ
    • rzmanta - OCF-RZ
    • shark - SCF
  • Primary purpose is to provide platforms where Tri-lab users can begin porting and preparing for the hardware and software that will be delivered with the final Sierra systems.
  • Similar to the final delivery Sierra systems but use the previous generation IBM Power processors and NVIDIA GPUs.
  • IBM Power Systems S822LC Server:
    • Hybrid architecture using IBM POWER8+ processors and NVIDIA Pascal GPUs.
  • IBM POWER8+ processors:
    • 2 per node (dual-socket)
    • 10 cores/socket; 20 cores per node
    • 8 SMT threads per core; 160 SMT threads per node
    • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LC speeds can vary from approximately 2 GHz - 4 GHz.
  • NVIDIA GPUs:
    • 4 NVIDIA Tesla P100 (Pascal) GPUs per compute node (not on login/service nodes)
    • 3584 CUDA cores per GPU; 14,336 per node
  • Memory:
    • 256 GB DDR4 per node
    • 16 GB HBM2 (High Bandwidth Memory 2) per GPU; 732 GB/s peak bandwidth
  • NVLINK 1.0:
    • Interconnect for GPU-GPU and CPU-GPU shared memory
    • 4 links per GPU/CPU with 160 GB/s total bandwidth (bidirectional)
  • NVRAM:
    • 1.6 TB NVMe PCIe SSD per compute node (CZ ray system only)
  • Network:
    • Mellanox 100 Gb/s Enhanced Data Rate (EDR) InfiniBand
    • One dual-port 100 Gb/s EDR Mellanox adapter per node
  • Parallel File System: IBM Spectrum Scale (GPFS)
    • ray: 1.3 PB
    • rzmanta: 431 TB
    • shark: 431 TB
  • Batch System: IBM Spectrum LSF
  • System Details:
CORAL Early Access (EA) Systems
Cluster OCF
SCF
Architecture Clock Speed (GHz) Nodes
GPUs
Cores
/Node
/GPU
Cores Total Memory/
Node GB
Memory
Total (GB)
TFLOPS
peak
Switch ASC
M&IC
ray OCF IBM Power8
NVIDIA Tesla P100 (PASCAL)
2.0-4.0
1481 MHz
62
54*4
20
3484
1,240
752,544
256
16*4
15,872
3,456
39.7
1,144.8
IB EDR ASC/M&IC
rzmanta OCF IBM Power8
NVIDIA Tesla P100 (PASCAL)
2.0-4.0
1481 MHz
44
36*4
20
3484
880
501,696
256
16*4
11,264
2,304
28.2
763.2
IB EDR ASC
shark SCF IBM Power8
NVIDIA Tesla P100 (PASCAL)
2.0-4.0
1481 MHz
44
36*4
20
3484
880
501,696
256
16*4
11,264
2,304
28.2
763.2
IB EDR ASC

Sierra Systems:

  • Sierra is a classified, 125 petaflop, IBM Power Systems AC922 hybrid architecture system comprised of IBM POWER9 nodes with NVIDIA Volta GPUs. Sierra is a Tri-lab resource sited at Lawrence Livermore National Laboratory.
  • Unclassified Sierra systems are similar, but smaller, and include:
    • lassen - a 20 petaflop system located on LC's CZ zone.
    • rzansel - a 1.5 petaflop system is located on LC's RZ zone.
  • IBM Power Systems AC922 Server:
    • Hybrid architecture using IBM POWER9 processors and NVIDIA Volta GPUs.
  • IBM POWER9 processors (compute nodes):
    • 2 per node (dual-socket)
    • 22 cores/socket; 44 cores per node
    • 4 SMT threads per core; 176 SMT threads per node
    • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LC speeds can vary from approximately 2.3 - 3.8 GHz. LC can also set the clock to a specific speed regardless of workload.
  • NVIDIA GPUs:
    • 4 NVIDIA Tesla V100 (Volta) GPUs per compute, login, launch node
    • 5120 CUDA cores per GPU; 20,480 per node
  • Memory:
    • 256 GB DDR4 per compute node
    • 16 GB HBM2 (High Bandwidth Memory 2) per GPU; 900 GB/s peak bandwidth
  • NVLINK 2.0:
    • Interconnect for GPU-GPU and CPU-GPU shared memory
    • 6 links per GPU/CPU with 300 GB/s total bandwidth (bidirectional)
  • NVRAM:
    • 1.6 TB NVMe PCIe SSD per compute node
  • Network:
    • Mellanox 100 Gb/s Enhanced Data Rate (EDR) InfiniBand
    • One dual-port 100 Gb/s EDR Mellanox adapter per node
  • Parallel File System: IBM Spectrum Scale (GPFS)
  • Batch System: IBM Spectrum LSF
  • Water (warm) cooled compute nodes
  • System Details:
Sierra Systems (compute nodes)
Cluster OCF
SCF
Architecture Clock Speed (GHz) Nodes
GPUs
Cores
/Node
/GPU
Cores Total Memory/
Node (GB)
Memory
Total (GB)
TFLOPS
Peak
Switch ASC
M&IC
sierra SCF IBM Power9
NVIDIA TeslaV100 (Volta)
2.0-3.1
1530 MHz
4320
4320*4
44
5120
190,080
88,473,600
256
16*4
1,105,920
276,480
125,000 IB EDR ASC
lassen OCF IBM Power9
NVIDIA TeslaV100 (Volta)
2.0-3.1
1530 MHz
684
684*4
44
5120
30,096
14,008,320
256
16*4
175,104
43,776
19,900 IB EDR ASC/M&IC
rzansel OCF IBM Power9
NVIDIA TeslaV100 (Volta)
2.0-3.1
1530 MHz
54
54*4
44
5120
2376
1,105,920
256
16*4
13,824
3,456
1,570 IB EDR ASC

Photos:

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Hardware

Sierra Systems General Configuration

System Components:

  • The basic components of a Sierra system are the same as other LC systems. They include:
    • Frames / Racks
    • Nodes
    • File Systems
    • Networks
    • HPSS Archival Storage

Frames / Racks:

  • Frames are the physical cabinets that hold most of a cluster's components:
    • Nodes of various types
    • Switch components
    • Other network and cluster management components
    • Parallel file system disk resources (usually in separate racks)
  • Power and console management - frames include hardware and software that allow system administrators to perform most tasks remotely.

Nodes:

  • Sierra systems consist of several different node types:
    • Compute nodes
    • Login / Launch nodes
    • I/O nodes
    • Service / management nodes
  • Compute Nodes:
    • Comprise the heart of a system. This is where parallel user jobs run.
    • Dual-socket IBM POWER9 (AC922) nodes
    • 4 NVIDIA Tesla V100 (Volta) GPUs per node
  • Login / Launch Nodes:
    • When you connect to Sierra, you are placed on a login node. This is where users edit, compile, submit jobs and interact with the batch system.
    • Launch nodes are similar to login nodes, but are dedicated to running user batch scripts, which in turn launch parallel jobs on compute nodes using jsrun (discussed later).
    • Login / launch nodes are shared by multiple users and should not be used themselves to run parallel jobs.
    • IBM Power9 with 4 NVIDIA Volta GPUs (same as compute nodes)
  • I/O Nodes:
    • Dedicated file servers for IBM Spectrum Scale parallel file systems
    • Not directly accessible to users
    • IBM Power9, dual-socket; no GPUs
  • Service / Management Nodes:
    • Reserved for system related functions and services
    • Not directly accessible to users
    • IBM Power9, dual-socket; no GPUs

Networks:

  • Sierra systems have a Mellanox 100 Gb/s Enhanced Data Rate (EDR) InfiniBand network:
    • Internal, inter-node network for MPI communications and I/O traffic between compute nodes and I/O nodes.
    • See the Mellanox EDR InfiniBand Network section for details.
  • InfiniBand networks connect other clusters and parallel file servers (lscratch and gscratch).
  • A GigE network connects InfiniBand networks, HPSS and external networks and systems.

File Systems:

  • Parallel file systems: Sierra systems use IBM Spectrum Scale (gscratch). Other clusters use Lustre (lscratch).
  • Other file systems (not shown) such as NFS (home directories, temp) and infrastructure services

Archival HPSS Storage:

Hardware

IBM POWER8 Architecture

Used by LLNL's Early Access systems only (ray, rzmanta, shark)

IBM POWER8 SL822LC Node Key Features:

  • 2 IBM "POWER8+" processors (dual-socket)
  • Up to 4 NVIDIA Tesla P100 (Pascal) GPUs
  • NVLink GPU-CPU and GPU-GPU interconnect technology
  • Memory:
    • Up to 1024 GB DDR4 memory per node
    • LC's Early Access systems compute nodes have 256 GB memory
    • Each processor connects to 4 memory riser cards with 4 DIMMs;
    • Processor-to-memory bandwidth of 115 GB/s bandwidth per processor, 230 GB/s memory bandwidth per node
  • L4 cache: up to 64 MB per processor, in 16 MB banks of memory buffers
  • Storage: 2 disk bays for 2 hard disk drives (HDD) or 2 solid state drives (SSD). Optional NVMe SSD support in PCIe slots.
  • Coherent Accelerator Processor Interface (CAPI), which allows accelerators plugged into a PCIe slot to access the processor bus by using a low latency, high-speed protocol interface.
  • 5 integrated PCIe Gen 3 slots:
    • 1 PCIe x8 G3 LP slot, CAPI enabled
    • 1 PCIe x16 G3, CAPI enabled
    • 1 PCIe x8 G3
    • 2 PCIe x16 G3, CAPI enabled that support GPU or PCIe adapters
  • Adaptive power management
  • I/O ports: 2x USB 3.0; 2x 1 GB Ethernet; VGA
  • 2 hotswap, redundant power supplies (no power redundancy with GPU(s) installed)
  • 19-inch rackmount hardware (2U)
  • LLNL's Early Access POWER8 nodes:
    • Compute nodes are model 8335-GTB and login nodes are model 8335-GCA. The primary difference is that compute nodes include 4 NVIDIA Pascal GPUs and Power8 processors with NVLink technology.
    • Power8 processors use 10 cores
    • Memory: 256 GB per node
    • The CZ Early Access cluster "Ray" also has 1.6 TB NVMe PCIe SSD (attached solid state storage).
  • Images
  • A POWER8 compute node and its primary components are shown below. Relevant individual components are discussed in more detail in sections below.
  • Click for a larger image. (Source: "IBM Power Systems S822LC for High Performance Computing Technical Overview and Introduction". IBM Redpaper publication REDP-5405-00 by Alexandre Bicas Caldeira, Volker Haug, Scott Vetter. September, 2016)

POWER8 SL822LC node with 4 NVIDIA Pascal GPUs

POWER8 SL822LC node logical system diagram

POWER8 Processor Key Characteristics:

  • IBM 22 nm Silicon-On-Insulator (SOI) technology; 4.2 billion transistors
  • Up to 12 cores (LLNL's Early Access processors have 10 cores)
  • L1 data cache: 64 KB per core, 8-way, private
  • L1 instruction cache: 32 KB per core, 8-way, private
  • L2 cache: 512 KB per core, 8-way, private
  • L3 cache: 96 MB (12 core version), 8-way, shared as 8 MB banks per core
  • Hardware transactional memory
  • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LLNL speeds can vary from approximately 2 GHz - 4 GHz.
  • Images:
    • Images of the POWER8 processor chip (12 core version) are shown below. Click for a larger version. (Source: "An Introduction to POWER8 Processor". IBM presentation by Joel M. Tendler. Georgia IBM POWER User Group, January 16, 2014)

POWER8 Core Key Features:

  • The POWER8 processor core is a 64-bit implementation of the IBM Power Instruction Set Architecture (ISA) Version 2.07
  • Little Endian
  • 8-way Simultaneous Multithreading (SMT)
  • Floating point units: Two integrated multi-pipeline vector-scalar. Run both scalar and SIMD-type instructions, including the Vector Multimedia Extension (VMX) instruction set and the improved Vector Scalar Extension (VSX) instruction set. Each is capable of up to eight single precision floating point operations per cycle (four double precision floating point operations per cycle)
  • Two symmetric fixed-point execution units
  • Two symmetric load and store units and two load units, all four of which can also run simple fixed-point instructions
  • Enhanced prefetch, branch prediction, out-of-order execution
  • Images:
    • Images of the POWER8 cores are shown below. Click for a larger version. (Source: "An Introduction to POWER8 Processor". IBM presentation by Joel M. Tendler. Georgia IBM POWER User Group, January 16, 2014

References and More Information:

Hardware

IBM POWER9 Architecture

Used by LLNL's Sierra systems only (sierra, lassen, rzansel)

IBM POWER9 AC922 Node Key Features:

  • 2 IBM POWER9 processors (dual-socket)
  • Up to 6 NVIDIA Tesla V100 (Volta) GPUs
  • NVLink2 GPU-CPU and GPU-GPU interconnect technology
  • Memory: Up to 2 TB, from 16 DDR4 Sockets.
    • Up to 2 TB DDR4 memory per node
    • LC's Sierra systems compute nodes have 256 GB memory
    • Each processor connects to 8 DDR4 DIMMs
    • Processor-to-memory bandwidth of 170 GB/s per processor, 340 GB/s per node.
  • Storage: 2 disk bays for 2 hard disk drives (HDD) or 2 solid state drives (SSD). Optional NVMe SSD support in PCIe slots.
  • Coherent Accelerator Processor Interface (CAPI) 2.0, which allows accelerators plugged into a PCIe slot to access the processor bus by using a low latency, high-speed protocol interface.
  • 4 integrated PCIe Gen 4 slots providing ~2x the data bandwidth of PCIe Gen 3:
    • 2 PCIe x16 G4, CAPI enabled
    • 1 PCIe x8 G4, CAPI enabled
    • 1 PCIe x4 G4
  • Adaptive power management
  • I/O ports: 2x USB 3.0; 2x 1 GB Ethernet; VGA
  • 2 hotswap, redundant power supplies
  • 19-inch rackmount hardware (2U)
  • Images
    • A POWER9 AC922 compute node and its primary components are shown below. Relevant individual components are discussed in more detail in sections below.
    • Click for a larger image. (Source: "IBM Power System AC922 Introduction and Technical Overview". IBM Redpaper publication REDP-5472-00 by Alexandre Bicas Caldeira. March, 2018)

POWER9 AC922 node with 6 NVIDIA Volta GPU

POWER9 AC922 server logical system diagram

POWER9 Processor Key Characteristics:

  • IBM 14 nm Silicon-On-Insulator (SOI) technology; 8 billion transistors
  • IBM offers POWER9 in two different designs: Scale-Out and Scale-Up
  • Scale-Out:
    • Designed for traditional datacenter clusters utilizing single-socket and dual-socket servers.
    • Optimized for Linux servers
    • Up to 120 GB/s bandwidth to directly attached DDR4 memory
    • 24-core and 12-core models
  • Scale-Up:
    • Designed for NUMA servers with four or more sockets, supporting large amounts of memory capacity and throughput.
    • Optimized for PowerVM servers
    • Up to 230 GB/s bandwidth to buffered memory
    • 24-core and 12-core models
  • Core variants: Some POWER9 models vary the number of active cores and have 16, 18, 20 or 22 cores. LLNL's AC922 compute nodes use 22 cores.
  • Hardware threads:
    • 12-core processors are SMT8 (8 hardware threads/core)
    • 24-core processors are SMT4 (4 hardware threads/core).
  • L1 data cache: 32 KB per core, 8-way, private
  • L1 instruction cache: 32 KB per core, 8-way, private
  • L2 cache: 512 KB per core (SMT8), 512 KB per core pair (SMT4), 8-way, private
  • L3 cache: 120 MB, 20-way, shared as twelve 10 MB banks
  • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LC speeds can vary from approximately 2.3 - 3.8 GHz. LC can also set the clock to a specific speed regardless of workload.
  • High-throughput on-chip fabric: Over 7 TB/s aggregate bandwidth via on-chip switch connecting cores to memory, PCIe, GPUs, etc.
  • Images:
    • Schematics of the POWER9 processor chip varients are shown below. Click for a larger version. (Source: "POWER9 Processor for the Cognitive Era". IBM presentation by Brian Thompto. Hot Chips 28 Symposium, October 2016)

Scale-Out Models

Scale-Up Models

 

 

 

 

 

 

 

 

 

  • Images of the POWER9 processor chip die are shown below. Click for a larger version. (Source: "POWER9 Processor for the Cognitive Era". IBM presentation by Brian Thompto. Hot Chips 28 Symposium, October 2016)

POWER9 Core Key Features:

  • The POWER9 processor core is a 64-bit implementation of the IBM Power Instruction Set Architecture (ISA) Version 3.0
  • Little Endian
  • 8-way (SMT8) or 4-way (SMT4) hardware threads
  • Basic building block of both SMT4 and SMT8 cores is a slice:
    • A slice is a rudimentary 64-bit single threaded processing element with a load store unit (LSU), integer unit (ALU) and vector scalar unit (VSU, doing SIMD and floating point).
    • Two slices are combined to make a 128-bit "super-slice"
    • Both SMT4 and SMT8 cores contain the same number of slices (threads) = 96.
  • Shorter fetch-to-compute pipeline than POWER8; reduced by 5 cycles.
  • Instructions per cycle: 128 for SMT8, 64 for SMT4
  • Images:
    • Schematic of a POWER9 SMT4 core is shown below. Click for a larger version. (Source: "POWER9 Processor for the Cognitive Era". IBM presentation by Brian Thompto. Hot Chips 28 Symposium, October 2016)

References and More Information:

Hardware

NVIDIA Tesla P100 (Pascal) Architecture

Used by LLNL's Early Access systems only (ray, rzmanta, shark)

Tesla P100 Key Features:

  • "Extreme performance" for HPC and Deep Learning:
    • 5.3 TFLOPS of double-precision floating point (FP64) performance
    • 10.6 TFLOPS of single-precision (FP32) performance
    • 21.2 TFLOPS of half-precision (FP16) performance
  • NVLink: NVIDIA's high speed, high bandwidth interconnect
    • Connects multiple GPUs to each other, and GPUs to the CPUs
    • 4 NVLinks per GPU
    • Up to 160 GB/s bidirectional bandwidth between GPUs (5x the bandwidth of PCIe Gen 3 x16)
  • HBM2: High Bandwidth Memory 2
    • Memory is located on same physical package as the GPU, providing 3x the bandwidth of previous GPUs such as the Maxwell GM200
    • Highly tuned 16 GB HBM2 memory subsystem delivers 732 GB/sec peak memory bandwidth on Pascal.
  • Unified Memory:
    • Significant advancement and a major new hardware and software-based feature of the Pascal GP100 GPU architecture.
    • First NVIDIA GPU to support hardware page faulting, and when combined with new 49-bit (512 TB) virtual addressing, allows transparent migration of data between the full virtual address spaces of both the GPU and CPU.
    • Provides a single, seamless unified virtual address space for CPU and GPU memory.
    • Greatly simplifies GPU programming - programmers no longer need to manage data sharing between two different virtual memory systems.
  • Compute Preemption:
    • New hardware and software feature that allows compute tasks to be preempted at instruction-level granularity.
    • Prevents long-running applications from either monopolizing the system or timing out. For example, both interactive graphics tasks and interactive debuggers can run simultaneously with long-running compute tasks.
  • Images:
    • NVIDIA Tesla P100 with Pascal GP100 GPU. Click for larger image. (Source: NVIDIA Tesla P100 Whitepaper. NVIDIA publication WP-08019-001_v01.1. 2016)
 

 

 

 

 

 

 

 

Pascal GP100 GPU Components:

  • A full GP100 includes 6 Graphics Processing Clusters (GPC)
  • Each GPC has 10 Pascal Streaming Multiprocessors (SM) for a total of 60 SMs
  • Each SM has:
    • 64 single-precision CUDA cores for a total of 3840 single-precision cores
    • 4 Texture Units for a total of 240 texture units
    • 32 double-precision units for a total of 1920 double-precision units
    • 16 load/store units, 16 special function units, register files, instruction buffers and cache, warp schedulers and dispatch units
  • L2 cache size of 4096 KB
  • Note the Tesla P100 does not use a full Pascal GP100. It uses 56 SMs instead of 60, for a total core count of 3584
  • Images:
    • Diagrams of a full Pascal GP100 GPU and a single SM. Click for larger image. (Source: NVIDIA Tesla P100 Whitepaper. NVIDIA publication WP-08019-001_v01.1. 2016)

Pascal GP100 Full GPU with 60 SM Units

Pascal GP100 SM Unit

References and More Information:

Hardware

NVIDIA Tesla V100 (Volta) Architecture

Used by LLNL's Sierra systems only (sierra, lassen, rzansel)

Tesla P100 Key Features:

  • New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning:
    • 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope.
    • Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPS for training and 6x higher peak TFLOPS for inference.
    • With independent parallel integer and floating-point data paths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations.
    • Independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads.
    • Combined L1 data cache and shared memory unit significantly improves performance while also simplifying programming.
  • Performance:
    • 7.8 TFLOPS of double-precision floating point (FP64) performance
    • 15.7 TFLOPS of single-precision (FP32) performance
    • 125 Tensor TFLOPS
  • Second-Generation NVIDIA NVLink:
    • Delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations.
    • Supports up to six NVLink links and total bandwidth of 300 GB/sec, compared to four NVLink links and 160 GB/s total bandwidth on Pascal.
    • Now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers.
    • The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
  • HBM2 Memory: Faster, Higher Efficiency
    • Highly tuned 16 GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth.
    • The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100, with up to 95% memory bandwidth utilization running many workloads.
  • Volta Multi-Process Service (MPS):
    • Enables multiple compute applications to share GPUs.
    • Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
  • Enhanced Unified Memory and Address Translation Services:
    • Provides a single, seamless unified virtual address space for CPU and GPU memory.
    • Greatly simplifies GPU programming - programmers no longer need to manage data sharing between two different virtual memory systems.
    • Includes new access counters to allow more accurate migration of memory pages to the processor that accesses them most frequently, improving efficiency for memory ranges shared between processors.
    • On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU's page tables directly.
  • Maximum Performance and Maximum Efficiency Modes:
    • In Maximum Performance mode, the Tesla V100 accelerator will operate up to its TDP (Thermal Design Power) level of 300 W to accelerate applications that require the fastest computational speed and highest data throughput.
    • Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
  • Cooperative Groups and New Cooperative Launch APIs:
    • Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads.
    • Allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions.
    • Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new cooperative launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
  • Volta Optimized Software:
    • New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance.
    • Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning inference and High Performance Computing (HPC) applications.
    • The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
  • Images:
    • NVIDIA Tesla V100 with Volta GV100 GPU. Click for larger image. (Source: NVIDIA Tesla V100 Whitepaper. NVIDIA publication WP-08608-001_v1.1. August 2017)

 

 

 

 

 

 

 

 

Volta GV100 GPU Components:

  • A full GV100 includes 6 Graphics Processing Clusters (GPC)
  • Each GPC has 14 Volta Streaming Multiprocessors (SM) for a total of 84 SMs
  • Each SM has:
    • 64 single-precision floating-point cores; GPU total of 5376
    • 64 single-precision integer cores; GPU total of 5376
    • 32 double-precision floating-point cores; GPU total of 2688
    • 8 Tensor Cores; GPU total of 672
    • 4 Texture Units; GPU total of 168
    • 32 load/store units, 4 special function units, register files, instruction buffers and cache, warp schedulers and dispatch units
  • L2 cache size of 6144 KB
  • Note the Tesla V100 does not use a full Volta GV100. It uses 80 SMs instead of 84, for a total "CUDA" core count of 5120 versus 5376.
  • Images:
    • Diagrams of a full Volta GV100 GPU and a single SM. Click for larger image. (Source: NVIDIA Tesla V100 Whitepaper. NVIDIA publication WP-08608-001_v1.1. August 2017)

Volta GV100 Full GPU with 84 SM Units

Volta GV100 SM Unit

References and More Information:

Hardware

NVLink

Overview:

  • NVLink is NVIDIA's high-speed interconnect technology for GPU accelerated computing. Used to connect GPUs to GPUs and/or GPUs to CPUs.
  • Significantly increases performance for both GPU-to-GPU and GPU-to-CPU communications.
  • NVLink - first generation
    • Debuted with Pascal GPUs
    • Used on LC's Early Access systems (ray, rzmanta, shark)
    • Supports up to 4 NVLink links per GPU.
    • Each link provides a 40 GB/s bidirectional connection to another GPU or a CPU, yielding an aggregate bandwidth of 160 GB/s.
  • NVLink 2.0 - second generation
    • Debuted with Volta GPUs
    • Used on LC's Sierra systems (sierra, lassen, rzansel)
    • Supports up to 6 NVLink links per GPU.
    • Each link provides a 50 GB/s bidirectional connection to another GPU or a CPU, yielding an aggregate bandwidth of 300 GB/s.
  • Multiple links can be "ganged" to increase bandwidth between two endpoints
  • Numerous NVLink topologies are possible, and different configurations can be optimized for different applications.
  • LC's NVLink configurations:
    • Early Access systems (ray, rzmanta, shark): Each CPU is connected to 2 GPUs by 2 NVLinks each. Those GPUs are connected to each other by 2 NVLinks each
    • Sierra systems (sierra, lassen, rzansel): Each CPU is connected to 2 GPUs by 3 NVLinks each. Those GPUs are connected to each other by 3 NVLinks each
    • GPUs on different CPUs do not connect to each other with NVLinks
  • Images:
    • Two representative NVLink 2.0 topologies are shown below. (Source: NVIDIA Tesla V100 Whitepaper. NVIDIA publication WP-08608-001_v1.1. August 2017)

V100 with NVLink Connected GPU-to-GPU and GPU-to-CPU
(LC's Sierra systems)

Hybrid Cube Mesh NVLink GPU-to-GPU Topology with V100

References and More Information:

Hardware

Mellanox EDR InfiniBand Network

Hardware:

  • Mellanox EDR InfiniBand is used for both Early Access and Sierra systems:
    • EDR = Enhanced Data Rate
    • 100 Gb/s bandwidth rating
  • Adapters:
    • Nodes have one dual-port Mellanox ConnectX EDR Infiniband adapter (at LC)
    • Both PCIe Gen 3.0 and Gen 4.0 capable
    • Adapter ports connect to level 1 switches
  • Top-of-Rack (TOR) level 1 (edge) switches:
    • Mellanox Switch-IB with 36 ports
    • Down ports connect to node adapters
    • Up ports connect to level 2 switches
  • Director level 2 (core) switches:
    • Mellanox CS7500 with 648 ports
    • Holds 18 Mellanox Switch-IB 36-port leafs
    • Ports connect down to level 1 switches
  • Images:
    • Mellanox EDR InfiniBand network hardware components are shown below. Click for larger image. (Source: mellanox.com)

Mellanox ConnectX dual-port IB adapter

Mellanox Switch-IB Top-of-Rack
(edge) switches

Mellanox CS7500 Director (core) switch

Mellanox CS7500 labeled

Topology and LC Sierra Configuration:

  • 2 to 1 Tapered Fat Tree, Single Plane Topology
    • Fat Tree: switches form a hierarchy with higher level switches having more (hence, fat) connections down than lower level switches.
    • 2 to 1 Tapered: the number of connections down for lower level switches are increased by a ratio of two-to-one.
    • Single Plane: nodes connect to a single fat tree network.
  • Sierra configuration details:
    • Each rack has 18 nodes and 2 TOR switches
    • Each node's dual-port adapter connects to both of its rack's TOR switches with one port each. That equals 18 uplinks to each TOR within a rack.
    • Each TOR switch has 12 uplinks to Director switches, at least one per Director switch
    • There are 9 Director switches
    • Because each TOR switch has 12 uplinks and there are only 9 Director switches, there are 3 extra uplinks per TOR switch. These are used to connect twice to 3 of the 9 Director switches.
    • Note Sierra has a "modified" 2:1 Tapered Fat Tree. It's actually 1.5 to 1 (18 links down, 12 links up for each TOR switch).
  • At LC, adapters connect to level 1 switches via copper cable. Level 1 switches connect to level 2 switches via optic fiber.
  • Images:
    • Topology diagrams shown below. Click for larger image.

Fat Tree Network

Sierra Network

References and More Information:

Hardware

NVMe PCIe SSD (Burst Buffer)

Overview:

  • Sierra compute nodes and login nodes are configured with 1.6 TB of NVMe PCIe SSD:
    • SSD = Solid State Drive; non-volatile storage device with no moving parts
    • PCIe = Peripheral Component Interconnect Express; standard high-speed serial bus connection.
    • NVMe = Non-Volatile Memory Express; device interface specification for accessing non-volatile storage media attached via PCIe bus
  • CORAL Early Access systems: Ray compute nodes have 1.6 TB of NVMe PCIe SSD. The shark and rzmanta systems do not have SSD.
  • Primary purpose of this fast storage is to act as a "Burst Buffer" for improving I/O performance. Computation can continue while the fast SSD "holds" data (such as checkpoint files) being written to slower disk.
  • At LC, the SSD is mounted under /l/nvme (lower case "L" / nvme).
    • Users can write/read directly to this location.
    • Local to the node (not global memory for the entire system)
    • LC has not implemented specific use policies yet
    • Not backed up, not purged
    • As with all SSDs, life span is shortened with writes
  • There are a number of ongoing efforts to develop software around using the burst buffers.
  • For more information on LC specifics, see https://lc.llnl.gov/confluence/display/CORALEA/Burst+Buffers.
  • Images:
    • 1.6 TB NVMe PCIe SSD. Click for larger image. (Sources: samsung.com and hgst.com)

Samsung PM1725

HGST Ultrastar SN100 (front)

HGST Ultrastar SN100 (back)

References and More Information:

Accounts, Allocations and Banks

 Accounts:

  • Only a brief summary of LC account request procedures is included below. For details, see: https://hpc.llnl.gov/accounts
  • Sierra:
    • Sierra is considered a Tri-lab Advanced Technology System (ATS).
    • Accounts on the classified sierra system are restricted to approved Tri-lab (LLNL, LANL, SNL) users.
    • Guided by the ASC Advanced Technology Computing Campaign (ATCC) proposal process and usage model.
  • Accounts for the other Sierra systems (lassen, rzansel) and Early Access systems (ray, shark, rzmanta) follow the usual account request processes, summarized below.
  • LLNL and Collaborators:
  • LANL and Sandia:
  • PSAAP centers:
  • For any questions or problems regarding accounts, please contact the LC Hotline account specialists:

Allocations and Banks:

  • TO BE COMPLETED LATER

Accessing LC's Sierra Machines

Overview:

  • RSA tokens are used for authentication:
    • Static 4-8 character PIN + 6 digits from token
    • There is one token for the CZ and SCF, and one token for the RZ.
    • Sandia / LANL Tri-lab logins can be done without tokens
  • Machine names and login nodes:
    • Each system has a single cluster login name, such as sierra, lassen, ray, etc.
    • A full llnl.gov domain name is required if coming from outside LLNL.
    • Successfully logging into the cluster will place you on one of the available login nodes.
    • User logins are distributed across login nodes for load balancing.
    • To view available login nodes use the nodeattr -c login command.
    • You can ssh from one login node to another, which may be useful if there are problems with the login node you are on.
  • X11 Forwarding
    • In order to display GUIs back to your local workstation, your SSH session will need to have X11 Forwarding enabled.
    • This is easily done by including the -X (uppercase X) or -Y option with your ssh command. For example: ssh -X sierra.llnl.gov
    • Your local workstation will also need to have X server software running. This comes with Linux by default. For Macs, something like XQuartz (http://www.xquartz.org/) can be used. For Windows, there are several options - LLNL provides X-Win32 with a site license.
  • SSH Clients

How To Connect:

  • Use the table below to connect to LC's Sierra systems.
Going to Coming from LLNL LANL/Sandia Other/Internet

SCF

sierra
shark

  • Need to be logged into an SCF network machine
  • ssh machinename command, or connect to machinename via your local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code
  • Login and kerberos authenticate with forwardable credentials (kinit -f) on a local, classified network machine.
  • For LANL only: then connect to the LANL gateway:
    ssh red-wtrw
  • SSH to LLNL using your LLNL username. For example:
    ssh -l joesmith sierra.llnl.gov
  • no password required
  • Login and authenticate on local Securenet attached machine
  • ssh -l lc_userid machinename.llnl.gov
  • Password: PIN + OTP token code

OCF-CZ

lassen
ray

  • Need to be logged into an OCF network machine
  • ssh machinename or connect via your local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code
  • Begin on a LANL/Sandia iHPC login node. For example, at Sandia start from ihpc-login.sandia.gov
  • ssh -l llnl-username machinename.llnl.gov
  • no password required
  • Login to a local unclassified network machine
  • ssh using your LC username or connect via your local SSH application. For example:
    ssh -l lc_userid machinename.llnl.gov
  • Userid: LC username
  • Password: PIN + OTP token code

OCF-RZ

rzansel
rzmanta

  • Need to be logged into a machine that is not part of the OCF Collaboration Zone (CZ)
  • ssh rzgw or connect to rzgw via your local SSH application
  • Userid: LC username
  • Password: PIN + CRYPTOCard token code
  • Then, ssh machinename
  • Userid: LC username
  • Password: PIN + OTP token code
  • Begin on a LANL/Sandia iHPC login node. For example, at Sandia start from ihpc-login.sandia.gov
  • ssh -l llnl-username rzgw.llnl.gov
  • Password: LLNL PIN + CRYPTOCard
  • On rzgw: kinit sandia-username@dce.sandia.gov or
    kinit lanl-username@lanl.gov
  • Enter Sandia/LANL kerberos password
  • ssh machinename
  • Password: not required
  • Start LLNL VPN client on local machine and authenticate to VPN with your LLNL OUN and PIN + OTP token code
  • ssh rzgw or connect to rzgw via your local SSH application
  • Userid: LC username
  • Password: PIN + CRYPTOCard token code
  • Then, ssh machinename
  • Userid: LC username
  • Password: PIN + OTP token code

Software and Development Environment

Similarities and Differences:

  • The Sierra software and development environment is similar in a number of ways to LC's other production clusters. Common topics are briefly discussed below, and covered in more detail in the Introduction to LC Resources tutorial.
  • Sierra systems are also very different from other LC systems in important ways. These differences are summarized below and covered in detail later in other sections.

Login Nodes:

  • Each LC cluster has a single, unique hostname used for login connections. This is called the "cluster login".
  • The cluster login is actually an alias for the real login nodes. It "rotates" logins between the actual login nodes for load balancing purposes.
  • For example: sierra.llnl.gov is the cluster login which distributes user logins over any number of physical login nodes.
  • The number of physical login nodes on any given LC cluster varies.
  • Login nodes are where you build your code, launch tools, edit files, submit batch jobs, run interactive jobs, etc.
    • Shared by multiple users
    • Should not be used to run production or parallel jobs, as this can impact other users
  • Users don't need to know (in most cases) the actual login node they are rotated onto - unless there are problems. Using the hostname command will indicate the actual login node name for support purposes.
  • If the login node you are on is having problems, you can ssh directly to another one. To find the list of available login nodes, use the command: nodeattr -c login
  • Cross-compilation is not necessary on Sierra clusters because login nodes have the same architecture as compute nodes.

Launch Nodes:

  • In addition to login nodes, Sierra systems have a set of nodes that are dedicated to launching user jobs. These are called launch nodes.
  • Typically, the bsub command is used to submit jobs:
    • Batch jobs: a job script is submitted which then automatically runs on a launch node.
    • Interactive jobs: a shell or xterm is opened, which runs on a launch node.
    • If it's a parallel job, using the jsrun command, then an allocation of compute nodes is also acquired for the job. The parallel tasks will then run on these nodes.
  • Further details on launch nodes are discussed as relevant in the Running Jobs section.

Login Shells and Files:

  • Your login shell is established when your LC account is intially setup. The usual login shells are supported:
    • /bin/bash
    • /bin/csh
    • /bin/ksh
    • /bin/sh
    • /bin/tcsh
    • /bin/zsh
  • All LC users automatically receive a set of login files. These include:
    .cshrc        .kshenv       .login        .profile
                  .kshrc        .logout
    .cshrc.linux  .kshrc.linux  .login.linux  .profile.linux

Operating System:

  • Sierra systems run Red Hat Enterprise Linux (RHEL) version 7.4 (as of Mar 1, 2018, subject to change).
  • Although they do not run the standard TOSS stack like other LC Linux clusters, LC has implemented some TOSS configurations, such as using /usr/tce instead of /usr/local.

Batch System:

  • Unlike most other LC clusters, Sierra systems do NOT use Slurm as their workload manager / batch system.
  • IBM's Platform LSF Batch System software is used to schedule/manage jobs run on all Sierra systems.
  • LSF is very different from Slurm:
    • Will require a bit of a learning curve for new users.
    • Existing job scripts will require modification.
    • Other scripts using Slurm commands will also require modification
  • LSF is discussed in detail in the Running Jobs section of this tutorial.

File Systems:

  • Sierra systems mount the usual LC file systems.
  • The only significant differences are:
    • Parallel file systems: IBM's Spectrum Scale product is used instead of Lustre.
    • NVMe SSD (burst buffer) storage is available
  • Available file systems are summarized in the table below and discussed in more detail in the File Systems section of the Livermore Computing Resources and Environment tutorial.
File System Mount Points Backed Up? Purged? Comments
Home directories /g/g0 - /g/g99 Yes No 16 GB quota; safest file system; includes .snapshot directory for online backups
Workspace /usr/workspace/ws* No No 1 TB quota for each user and each group; includes .snapshot directory for online backups
Local tmp /tmp
/usr/tmp
/var/tmp
No Yes Node local temporary file space; small; actually resides in node memory, not physical disk
NFS tmp /nfs/tmp2 No Yes Large NFS mounted temporary file space; shared by all users and multiple clusters
Collaboration /usr/gapps
/usr/gdata
/collab/usr/gapps
/collab/usr/gdata
Yes No User managed application directories; intended for collaborative development and usage
Parallel /p/gscratch* No Yes Intended for parallel I/O; large, shared by all users on a cluster. IBM's Spectrum Scale (not Lustre)
Burst buffer /l/nvme No Yes Available on sierra, lassen, rzansel and ray. Each compute node has a 1.6 TB NVMe PCIe SSD.
HPSS archival storage server based No No Virtually unlimited archival storage; accessed by "ftp storage" from LC machines.
FIS server based No Yes File Interchange System; for transferring files between unclassified/classified networks

HPSS Storage:

  • As with all other production LC systems, Sierra systems have access to LC's High Performance Storage System (HPSS) archival storage.
  • The HPSS system is named storage.llnl.gov on both the OCF and SCF.
  • LC does not backup temporary file systems, including the scratch parallel file systems. Users should backup their important files to storage.
  • Several different file transfer tools are available.
  • See https://computing.llnl.gov/tutorials/lc_resources/#Archival for details on using HPSS storage.

Modules:

  • As with LC's TOSS3 systems, Lmod modules are used for most software packages, such as compilers, MPI and tools.
  • Dotkits should no longer be used.
  • Users only need to know a few commands to effectively use modules - see the table below.
  • Note The "ml" shorthand can be used instead of "module" - for example: "ml avail"
  • See Using TOSS 3#Modules for more information.
Command Shorthand Description
module avail ml avail List available modules
module load package ml load package Load a selected module
module list ml Show modules currently loaded
module unload package ml unload package Unload a previously loaded module
module purge ml purge Unload all loaded modules
module reset ml reset Reset loaded modules to system defaults
module update ml update Reload all currently loaded modules
module display package n/a Display the contents of a selected module
module spider ml spider List all modules (not just available ones)
module keyword key ml keyword key Search for available modules by keyword
module
module help
ml keyword key Display module help

Compilers:

  • The following compilers are available and supported on LC's Sierra systems:
Compiler Description
XL IBM's XL C/C++ and Fortran compilers
Clang IBM's C/C++ clang compiler
GNU GNU compiler collection, C, C++, Fortran
PGI Portland Group compilers
NVCC NVIDIA's C/C++ compiler
Wrapper scripts LC provides wrappers for most compiler commands (serial GNU are the only exceptions). Additionally, LC provides wrappers for the MPI compiler commands.
  • Compilers are discussed in detail in the Compilers section.

Math Libraries

  • The following math libraries are available and supported on LC's Sierra systems:
Library Description
ESSL IBM's Engineering Scientific Subroutine Library
MASS, MASSV IBM's Mathematical Acceleration Subsystem libraries
BLAS, LAPACK, ScaLAPACK Netlib Linear Algebra Packages
FFTW Fast Fourier Transform library
PETSc Portable, Extensible Toolkit for Scientific Computation library
GSL GNU Scientific Library
CUDA Tools Math libraries included in the NVIDIA CUDA toolkit

Debuggers and Performance Analysis Tools:

Visualization Software and Compute Resources:

Compilers

Available Compilers:

  • The following compilers are available on Sierra systems, and are discussed in detail below, along with other relevant compiler related information:
    • XL: IBM's XL C/C++ and Fortran compilers
    • Clang: IBM's C/C++ clang compiler
    • GNU: GNU compiler collection, C, C++, Fortran
    • PGI: Portland Group compilers
    • NVCC: NVIDIA's C/C++ compiler

Compiler Recommendations:

  • The recommended and supported compilers are those delivered from IBM (XL and Clang ) and NVIDIA (NVCC):
    • Only XL and Clang compilers from IBM provide OpenMP 4.5 with GPU support.
    • NVCC offers direct CUDA support
    • The IBM xlcuf compiler also provides direct CUDA support
    • Please report all problems to the you may have with these to the LC Hotline so that fixes can be obtained from IBM and NVIDIA.
  • The other available compilers (GNU and PGI) can be used for experimentation and for comparisons to the IBM compilers:
    • Versions installed at LC do not provide Open 4.5 with GPU support
    • If you experience problems with the PGI compilers, LC can forward those issues to PGI.
  • Using OpenACC on LC's Sierra clusters is not recommended nor supported.

Wrappers Scripts:

  • LC has created wrappers for most compiler commands, both serial and MPI versions.
  • The wrappers perform LC customization and error checking. They also follow a string of links, which include other wrappers.
  • The wrappers located in /usr/tce/bin (in your PATH) will always point (symbolic link) to the default versions.
  • Note There may also be versions of the serial compiler commands in /usr/bin. Do not use these, as they are missing the LC customizations.
  • If you load a different module version, your PATH will change, and the location may then be in either /usr/tce/bin or /usr/tcetmp/bin.
  • To determine the actual location of the wrapper, simply use the command which compilercommand to view its path.
  • Example: show location of default/current xlc wrapper, load a new version, and show new location:
    % which xlc
    /usr/tce/packages/xl/xl-beta-2017.11.28/bin/xlc

    % module load xl/beta-2018.02.05

    Due to MODULEPATH changes the following have been reloaded:
      1) spectrum-mpi/2017.11.10

    The following have been reloaded with a version change:
      1) xl/beta-2017.11.28 => xl/beta-2018.02.05

    % which xlc
    /usr/tce/packages/xl/xl-beta-2018.02.05/bin/xlc

Versions:

  • There are several ways to determine compiler versions, discussed below.
  • The default version of compiler wrappers is pointed to from /usr/tce/bin.
  • To see available compiler module versions use the command module avail:
    • An (L) indicates which version is currently loaded.
    • A (D) indicates the default version.
  • For example:
    % module avail
    ------------------- /usr/tce/modulefiles/Compiler/xl/beta-2017.11.28 --------------------
       spectrum-mpi/2017.11.10 (L)

    ----------------------------- /usr/tcetmp/modulefiles/Core ------------------------------
       StdEnv                 (L)      gsl/2.3                   python/2.7.14
       clang/coral-2017.11.09 (D)      gsl/2.4            (D)    python/3.6.4         (D)
       clang/coral-2017.12.06          ibmppt/alpha-2.4.0        scorep/3.0.0
       cmake/3.7.2                     ibmppt/beta-2.4.0         tau/2.26.2
       cmake/3.9.2            (D)      ibmppt/beta2-2.4.0 (D)    tau/2.26.3           (D)
       cuda/8.0                        ibmppt/2.3                totalview/2016.07.22
       cuda/9.0.176                    makedepend/1.0.5          totalview/2017X.3.1
       cuda/9.0.184                    petsc/3.7.6               totalview/2017.0.12
       cuda/9.1.76            (L,D)    petsc/3.8.3        (D)    totalview/2017.1.21  (D)
       cuda/9.1.85                     pgi/17.4                  totalview/2017.2.11
       flex/2.6.4                      pgi/17.7                  xl/beta-2017.11.28   (L,D)
       gcc/4.9.3                       pgi/17.9           (D)    xl/beta-2018.02.05
       git/2.9.3                       pgi/17.10
       gmake/4.2.1                     python/2.7.13

    ------------------------- /usr/share/lmod/lmod/modulefiles/Core -------------------------
       lmod/6.5.1    settarg/6.5.1

      Where:
       L:  Module is loaded
       D:  Default Module

    Use "module spider" to find all possible modules.
    Use "module keyword key1 key2 ..." to search for all possible modules matching any of
    the "keys".
  • You can also use any of the following commands to get version information:
        module display compiler
        module help compiler
        module key compiler
        module spider compiler.
  • Examples below, using the IBM XL compiler (some output omitted):
    % module display xl

    -------------------------------------------------------------------------------------
       /usr/tcetmp/modulefiles/Core/xl/beta-2017.11.28.lua:
    -------------------------------------------------------------------------------------
    help([[LLVM/XL compiler beta 2017.11.28

    IBM XL C/C++ for Linux, V13.1.6 (5725-C73, 5765-J08)
    Version: 13.01.0006.0000
    The license for the ESP version of IBM XL C/C++ for Linux, V13.1.6 (Beta) compiler
    product will expire in 330 days on Wed Oct 31 22:00:00 2018.

    IBM XL Fortran for Linux, V15.1.6 (5725-C75, 5765-J10)
    Version: 15.01.0006.0000
    The license for the ESP version of IBM XL Fortran for Linux, V15.1.6 (Beta) compiler
    product will expire in 330 days on Wed Oct 31 22:00:00 2018.
    ]])
    whatis("Name: XL compilers")
    whatis("Version: beta-2017.11.28")
    whatis("Category: Compilers")
    whatis("URL: http://www.ibm.com/software/products/en/xlcpp-linux")
    family("compiler")
    prepend_path("MODULEPATH","/usr/tce/modulefiles/Compiler/xl/beta-2017.11.28")
    prepend_path("PATH","/usr/tce/packages/xl/xl-beta-2017.11.28/bin")
    prepend_path("MANPATH","/usr/tce/packages/xl/xl-beta-2017.11.28/xlC/13.1.6/man/en_US")
    prepend_path("MANPATH","/usr/tce/packages/xl/xl-beta-2017.11.28/xlf/15.1.6/man/en_US")

    % module help xl

    --------------------- Module Specific Help for "xl/beta-2017.11.28" ---------------------
    LLVM/XL compiler beta 2017.11.28

    IBM XL C/C++ for Linux, V13.1.6 (5725-C73, 5765-J08)
    Version: 13.01.0006.0000
    The license for the ESP version of IBM XL C/C++ for Linux, V13.1.6 (Beta) compiler
     product will expire in 330 days on Wed Oct 31 22:00:00 2018.

    IBM XL Fortran for Linux, V15.1.6 (5725-C75, 5765-J10)
    Version: 15.01.0006.0000
    The license for the ESP version of IBM XL Fortran for Linux, V15.1.6 (Beta) compiler
     product will expire in 330 days on Wed Oct 31 22:00:00 2018.

    % module key xl

    ------------------------------------------------------------------------------------
    The following modules match your search criteria: "xl"
    ------------------------------------------------------------------------------------
      spectrum-mpi: spectrum-mpi/2017.11.10
      xl: xl/beta-2017.11.28, xl/beta-2018.02.05
    ------------------------------------------------------------------------------------
    To learn more about a package enter:
       $ module spider Foo
    where "Foo" is the name of a module
    To find detailed information about a particular package you
    must enter the version if there is more than one version:
       $ module spider Foo/11.1

    % module spider xl

    ------------------------------------------------------------------------------------
      xl:
    ------------------------------------------------------------------------------------
         Versions:
            xl/beta-2017.11.28
            xl/beta-2018.02.05
    ------------------------------------------------------------------------------------
      For detailed information about a specific "xl" module (including how to load the
     modules) use the module's full name.
      For example:
         $ module spider xl/beta-2018.02.05
    ------------------------------------------------------------------------------------

    % module spider xl/beta-2018.02.05

    ------------------------------------------------------------------------------------
      xl: xl/beta-2018.02.05
    ------------------------------------------------------------------------------------
        This module can be loaded directly: module load xl/beta-2018.02.05
        Help:
          LLVM/XL compiler beta beta-2018.02.05
         
          IBM XL C/C++ for Linux, V13.1.7 (Beta 1)
          Version: 13.01.0007.0000
          The license for the ESP version of IBM XL C/C++ for Linux, V13.1.7 (Beta)
     compiler product will expire in 155 days on Thu Jul 12 22:00:00 2018.
         
          IBM XL Fortran for Linux, V15.1.7 (Beta 1)
          Version: 15.01.0007.0000
          The license for the ESP version of IBM XL Fortran for Linux, V15.1.7 (Beta)
     compiler product will expire in 155 days on Thu Jul 12 22:00:00 2018.
  • Finally, simply passing the --version option to the compiler invocation command will usually provide the version of the compiler. For example:
    % xlc --version
    IBM XL C/C++ for Linux, V13.1.6 (5725-C73, 5765-J08)
    Version: 13.01.0006.0000
    The license for the ESP version of IBM XL C/C++ for Linux, V13.1.6 (Beta) compiler
     product will expire in 251 days on Wed Oct 31 22:00:00 2018.

    % gcc --version
    gcc (GCC) 4.9.3
    Copyright (C) 2015 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    % clang --version
    clang version 3.8.0 (ibmgithub:/CORAL-LLVM-Compilers/clang.git 0c6657a4a7f35c9c0c4f7
    eab6e2dafac297918c1) (ibmgithub:/CORAL-LLVM-Compilers/llvm.git 2782ffd085eaa09d1c1ea
    0670851fa403b532f0)
    Target: powerpc64le-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /usr/tce/packages/clang/clang-coral-2017.11.09/ibm/bin

Selecting Your Compiler and MPI Version:

  • Compiler and MPI software is installed as packages under /usr/tce/packages and/or /usr/tcetmp/packages.
  • LC provides default packages for compilers and MPI. To see the current defaults, use the module avail command, as shown above in the Versions discussion. Note that a (D) next to a package shows that it is the default.
  • The default versions will change as newer versions are released.
    • It's recommended that you use the most recent default compilers to stay abreast of new fixes and features.
    • You may need to recompile your entire application when the default compilers change.
  • LMOD modules are used to select alternate compiler and MPI packages.
  • To select an alternate version of a compiler and/or MPI, use the following procedure:
  1. Use module list to see what's currently loaded
  2. Use module key compiler to see what compilers and MPI packages are available.
  3. Use module load package to load the selected package.
  4. Use module list again to confirm your selection was loaded.
  • Some examples (some output omitted):
    % module list

    Currently Loaded Modules:
      1) xl/beta-2017.11.28   2) spectrum-mpi/2017.11.10   3) cuda/9.1.76   4) StdEnv

    % module key compiler

    -----------------------------------------------------------------------------------
    The following modules match your search criteria: "compiler"
    -----------------------------------------------------------------------------------
      clang: clang/coral-2017.11.09, clang/coral-2017.12.06
      cuda: cuda/8.0, cuda/9.0.176, cuda/9.0.184, cuda/9.1.76, cuda/9.1.85
      gcc: gcc/4.9.3
      pgi: pgi/17.4, pgi/17.7, pgi/17.9, pgi/17.10
      spectrum-mpi: spectrum-mpi/2017.11.10
      xl: xl/beta-2017.11.28, xl/beta-2018.02.05
    -----------------------------------------------------------------------------------

    % module load xl/beta-2018.02.05

    Due to MODULEPATH changes the following have been reloaded:
      1) spectrum-mpi/2017.11.10

    The following have been reloaded with a version change:
      1) xl/beta-2017.11.28 => xl/beta-2018.02.05

    % module list

    Currently Loaded Modules:
      1) cuda/9.1.76   2) StdEnv   3) xl/beta-2018.02.05   4) spectrum-mpi/2017.11.10


    % module load pgi

    Lmod is automatically replacing "xl/beta-2018.02.05" with "pgi/17.9"

    Due to MODULEPATH changes the following have been reloaded:
      1) spectrum-mpi/2017.11.10

    % module list

    Currently Loaded Modules:
      1) cuda/9.1.76   2) StdEnv   3) pgi/17.9   4) spectrum-mpi/2017.11.10
  • Notes:
    • When a new compiler package is loaded, the MPI package will be reloaded to use a version built with the selected compiler.
    • Only one compiler package is loaded at a time, with a version of the IBM XL compiler being the default. If a new compiler package is loaded, it will replace what is currently loaded. The default compiler commands for all compilers will remain in your PATH however.

IBM XL Compilers:

IBM XL Compiler Commands          
Language Serial Serial +
OpenMP 4.5
MPI MPI +
OpenMP 4.5
Comments
C xlc xlc-gpu mpixlc
mpicc
mpixlc-gpu
mpicc-gpu
The -gpu commands add the flags:
-qsmp=omp
-qoffload
C++ xlC
xlc++
xlC-gpu
xlc++-gpu
mpixlC
mpiCC
mpic++
mpicxx
mpixlC-gpu
mpiCC-gpu
mpic++-gpu
mpicxx-gpu
 
Fortran xlf
xlf90
xlf95
xlf2003
xlf2008
xlf-gpu
xlf90-gpu
xlf95-gpu
xlf2003-gpu
xlf2008-gpu
mpixlf
mpixlf90
mpixlf95
mpixlf2003
mpixlf2008
mpixlf-gpu
mpixlf90-gpu
mpixlf95-gpu
mpixlf2003-gpu
mpixlf2008-gpu
 
  • Thread safety: LC always aliases the XL compiler commands to their _r (thread safe) versions. This is to prevent some known problems, particularly with Fortran. Note the /usr/bin/xl* commands are not aliased as such, and they are not LC wrapper scripts - use is discouraged.
  • OpenMP and GPU support: For convenience, LC provides the -gpu commands, which set the option -qsmp=omp for OpenMP and -qoffload for GPU offloading. Users can do this themselves without using the -gpu commands.
  • Optimizations:
    • The -O0 -O2 -O3 -Ofast options cause the compiler to run optimizing transformations to the user code, for both CPU and GPU code.
    • Options to target the Power8 architecture: -qarch=pwr8 -qtune=pwr8
    • Options to target the Power9 (Sierra) architecture: -qarch=pwr9 -qtune=pwr9
  • Debugging - recommended options:
    • -g -O0 -qsmp=omp:noopt -qoffload -qfullpath
    • noopt - This sub-option will minimize the OpenMP optimization. Without this, XL compilers will still optimize the code for your OpenMP code despite -O0. It will also disable RT inlining thus enabling GPU debug information
    • -qfullpath - adds the absolute paths of your source files into DWARF helping TotalView locate the source even if your executable moves to a different directory.
  • Documentation:

IBM Clang Compiler:

  • The Sierra systems use the Clang compiler from IBM.
  • As discussed previously:
  • Clang compiler commands are shown in the table below.
Clang Compiler Commands
Language Serial Serial +
OpenMP 4.5
MPI MPI +
OpenMP 4.5
Comments
C clang clang-gpu mpiclang mpiclang-gpu The -gpu commands add the flags:
-fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
 
C++ clang++ clang++-gpu mpiclang++ mpiclang++-gpu
  • Use of LC's -gpu commands for OpenMP 4.5 and GPU support is recommended at this time since the native Clang flags are verbose and subject to change.
  • Documentation:
    • TO BE ADDED LATER

GNU Compilers:

GNU Compiler Commands
Language Serial Serial +
OpenMP 4.5
MPI MPI +
OpenMP 4.5
Comments
C gcc
cc
n/a mpigcc n/a For OpenMP use the flag: -fopenmp
For OpenMP 4.0 use module load gcc/4.9.3 (or more recent)
OpenMP 4.5 will be supported in GCC 6.1
C++ g++
c++
n/a mpig++ n/a
Fortran gfortran n/a mpigfortran n/a

PGI Compilers:

PGI Compiler Commands
Language Serial Serial +
OpenMP 4.5
MPI MPI +
OpenMP 4.5
Comments
C pgcc
cc
n/a mpipgcc n/a pgf90 and pgfortran are the same compiler, supporting the Fortran 2003 language specification
C++ pgc++ n/a mpig++ n/a
Fortran pgf90
pgfortran
n/a mpipgf90
mpipgfortran
n/a
  • OpenMP 4.5 and GPU support: Not currently provided. Most of OpenMP 4.5 is supported, however it is not for GPU offload. Target regions are implemented on the multicore host instead. See the product documentation (link below) "Installation Guide and Release Notes" for details.
  • GPU support is via CUDA and OpenACC.
  • Documentation:

NVIDIA NVCC Compiler:

  • The NVIDIA nvcc compiler driver is used to compile C/C++ CUDA code:
    • nvcc compiles the CUDA code.
    • Non-CUDA compilation steps are forwarded to a C/C++ host (backend) compiler supported by nvcc.
    • nvcc also translates its options to appropriate host compiler command line options.
    • NVCC currently supports XL, GCC, and PGI C++ backends, with GCC being the default.
  • Location:
    • The NVCC C/C++ compiler is located under usr/tce/packages/cuda/.
    • Other NVIDIA software and utilities (like nvprof, nvvp) are located here also.
    • The default CUDA build should be in your default PATH.
  • As discussed previously:
  • Architecture flag:
    • Tesla P100 (Pascal) for Early Access systems: -arch=sm_60
    • Tesla V100 (Volta) for Sierra systems: -arch=sm_70
  • Selecting a host compiler:
    • The GNU C/C++ compiler is used as the backend compiler by default.
    • To select a different backend compiler, use the -ccbin=compiler flag. For example:
nvcc -arch=sm_70 -ccbin=xlC myprog.cu
nvcc -arch=sm_70 -ccbin=clang myprog.cu
  • The alternate backend compiler needs to be in your path. Otherwise you need to specify the full pathname.
  • Source file suffixes:
  • Source files with CUDA code should have a .cu suffix.
  • If source files have a different suffix, use the -x cu flag. For example:
nvcc -arch=sm_70 -ccbin=xlc -x cu myprog.c

MPI

IBM Spectrum MPI:

  • IBM Spectrum MPI is the only supported MPI library on LC's Sierra and CORAL EA systems.
  • IBM Spectrum MPI supports many, but not all of the features offered by Open MPI. It also adds some unique features of its own.
  • Implements MPI API 3.1.0
  • Supported features and usage notes:
    • 64-bit Little Endian for IBM Power Systems, with and without GPUs.
    • Thread safety: MPI_THREAD_MULTIPLE (multiple threads executing within the MPI library). However, multithreaded I/O is not supported.
    • GPU support using CUDA-aware MPI and NVIDIA CPUDirect RDMA. See product documentation for restrictions and limitations.
    • Parallel I/O: supports only ROMIO version 3.1.4. Multithreaded I/O is not supported. See the Spectrum MPI User's Guide for details.
    • MPI Collective Operations: defaults to using IBM's libcollectives library. Provides optimized collective algorithms and GPU memory buffer support. Using the Open MPI collectives is also supported. See the Spectrum MPI User's Guide for details.
    • Mellanox Fabric Collective Accelerator (FCA) support for accelerating collective operations.
    • Portable Hardware Locality (hwloc) support for displaying hardware topology information.
    • IBM Platform LSF workload manager is supported
    • Debugger support for Allinea DDT and Rogue Wave TotalView.
    • Process Management Interface Exascale (PMIx) support - see https://pmix.github.io/pmix/ for details.
  • Spectrum MPI provides the ompi_info command for reporting detailed information on the MPI installation. Simply type ompi_info.
  • Limitations: see the IBM Spectrum Release Notes, excerpted HERE.
  • For additional information about IBM Spectrum MPI, see the links under "Documentation" below.

Versions:

  • Use the module avail mpi command to display available MPI packages. For example:
    % module avail mpi

    --------------------- /usr/tce/modulefiles/Compiler/xl/beta-2018.02.22 ---------------------
       spectrum-mpi/2017.11.10    spectrum-mpi/2018.02.05 (L,D)

      Where:
       L:  Module is loaded
       D:  Default Module
  • As noted above, the default version is indicated with a (D), and the currently loaded version with a (L).
  • For more detailed information about versions, see the discussion under Compilers ==> Versions.
  • Selecting an alternate MPI version: simply use the command module load package.
  • For more additional discussion on selecting alternate versions, see Compilers ==> Selecting Your Compiler and MPI Version.

MPI and Compiler Dependency:

  • Each available version of MPI is built with each version of the available compilers.
  • The MPI package you have loaded will depend upon the compiler package you have loaded, and vice-versa:
    • Changing the compiler will automatically load the appropriate MPI-compiler build.
    • Changing the MPI package will automatically load an appropriate MPI-compiler build.
  • For example:
    • Show the currently loaded modules
    • Show details on the loaded MPI module
    • Load a different compiler and show how it changes the MPI build that's loaded
    % module list
    Currently Loaded Modules:
      1) xl/beta-2018.02.22   2) spectrum-mpi/2018.02.05   3) cuda/9.1.76   4) StdEnv

    % module whatis spectrum-mpi/2018.02.05
    spectrum-mpi/2018.02.05         : mpi/spectrum-mpi
    spectrum-mpi/2018.02.05         : spectrum-mpi-2018.02.05 for xl-beta-2018.02.22 compilers

    % module load xl/beta-2017.11.28
    Due to MODULEPATH changes the following have been reloaded:
      1) spectrum-mpi/2018.02.05

    The following have been reloaded with a version change:
      1) xl/beta-2018.02.22 => xl/beta-2017.11.28

    % module whatis spectrum-mpi/2018.02.05
    spectrum-mpi/2018.02.05         : mpi/spectrum-mpi
    spectrum-mpi/2018.02.05         : spectrum-mpi-2018.02.05 for xl-beta-2017.11.28 compilers

MPI Compiler Commands:

  • LC users wrapper scripts for all of its MPI compiler commands. Wrapper scripts are discussed HERE.
  • The table below lists the MPI commands for each compiler family.
Compiler Language MPI MPI +
OpenMP 4.5
Comments
IBM XL C mpixlc
mpicc
mpixlc-gpu
mpicc-gpu
The -gpu commands add the flags:
-qsmp=omp
-qoffload

 
C++ mpixlC
mpiCC
mpic++
mpicxx
mpixlC-gpu
mpiCC-gpu
mpic++-gpu
mpicxx-gpu
Fortran mpixlf
mpixlf90
mpixlf95
mpixlf2003
mpixlf2008
mpixlf-gpu
mpixlf90-gpu
mpixlf95-gpu
mpixlf2003-gpu
mpixlf2008-gpu
Clang C mpiclang mpiclang-gpu The -gpu commands add the flags:
-fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
C++ mpiclang++ mpiclang++-gpu
GNU C mpigcc n/a For OpenMP use the flag: -fopenmp
For OpenMP 4.0 use module load gcc/4.9.3 (or more recent)
OpenMP 4.5 will be supported in GCC 6.1
C++ mpig++ n/a
Fortran mpigfortran n/a
PGI C mpipgcc n/a pgf90 and pgfortran are the same compiler, supporting the Fortran 2003 language specification
C++ mpig++ n/a
Fortran mpipgf90
mpipgfortran
n/a

Running MPI Jobs:

  • Running MPI jobs on LC's Sierra systems is very different than other LC clusters.
  • IBM Platform LSF is used as the workload manager, not SLURM:
    • LSF syntax is used in batch scripts
    • LSF commands are used to submit, monitor and interact with jobs
  • The MPI job launch command is jsrun:
    • Replaces srun and mpirun on other clusters.
    • Developed by IBM for the Oak Ridge and Livermore CORAL systems.
    • Runs jobs on resources allocated through the LSF batch scheduler
    • Similar to srun and mpirun in functionality, but has a very different syntax, mostly due to its concept of Resource Sets
  • See the Running Jobs section for details on running MPI jobs.

Documentation:

OpenMP

OpenMP Support:

  • The OpenMP API is supported on Sierra systems for single-node, shared-memory parallel programming in C/C++ and Fortran.
  • On Sierra systems, the primary motivation for using OpenMP is to take advantage of the GPUs on each node:
    • OpenMP is used in combination with MPI as usual
    • On-node: MPI tasks identify computationally intensive sections of code for offloading to the node's GPUs
    • On-node: Parallel regions are executed on the node's GPUs
    • Inter-node: Tasks coordinate work across the network using MPI message passing communications
  • Note The ability to perform GPU offloading depends upon the compiler being used - see the table below.
  • The version of OpenMP support depends upon the compiler used. For example:
Compiler OpenMP Support GPU Offloading?
IBM XL C/C++ version 13+ Most of OpenMP 4.5 Yes
IBM XL Fortran version 15+ Most of OpenMP 4.5 Yes
IBM Clang C/C++ version 3.8 Most of OpenMP 4.5 Yes
GNU version 4.9.3 OpenMP 4.0 No
PGI version 17+ Most of OpenMP 4.5 No

See http://www.openmp.org/resources/openmp-compilers/ for the latest information.

Compiling:

  • The usual compiler flags are used to turn on OpenMP compilation.
  • GPU offloading currently requires additional flag(s) when supported.
  • Note For convenience, LC has created *-gpu wrapper scripts which turn on both OpenMP and GPU offloading (if applicable). Simply append -gpu to the usual compiler command. For example: mpixlc-gpu.
  • Also for convenience, LC aliases all IBM XL compiler commands to their thread-safe ( _r ) command .
  • The table below summarizes OpenMP compiler flags and wrapper scripts.
Compiler OpenMP flag GPU offloading flag LC *-gpu wappers?
IBM XL -qsmp=omp -qoffload Yes
IBM Clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda Yes
GNU -fopenmp n/a No
PGI -mp n/a No

More Information:

System Configuration and Status Information

First Things First:

  • Before you attempt to run your parallel application, it is important to know a few details about the way the system is configured. This is especially true at LC where every system is configured differently and where things change frequently.
  • It is also useful to know the status of the machines you intend on using. Are they available or down for maintenance?
  • System configuration and status information for all LC systems is readily available from the LC Homepage and the MyLC Portal. Summarized below.


MyLC User Portal: mylc.llnl.gov

System Configuration Information:

System Configuration Commands:

  • After logging into a machine, there are a number of commands that can be used for determining detailed, real-time machine hardware and configuration information.
  • A table of some useful commands with example output is provided below. Hyperlinked commands display their man page.
Command Description Example Output
news job.lim.machinename LC command for displaying system configuration, job limits and usage policies, where machinename is the actual name of the machine.

lscpu Basic information about the CPU(s), including model, cores, sockets, threads, clock and cache.

lscpu -e One line of basic information about the CPU(s), cores, sockets, threads and clock.

cat /proc/cpuinfo Model and clock information for each thread of each core.

topo Display a graphical topological map of node hardware.

lstopo --only cores List the physical cores only.

lstopo -v Detailed (verbose) information about a node's hardware components.

vmstat -s Memory configuration and usage details.

cat /proc/meminfo Memory configuration and usage details.
uname -a
distro_version
cat /etc/redhat-release
cat /etc/toss-release
Display operating system details, version.
bdf
df -h
Show mounted file systems.

bparams
bqueues
bhosts
lshosts
Display LSF system settings and options
Display LSF queue information
Display information about LSF hosts
Display information about LSF hosts

See the LSF Configuration Commands section for additional information.

 

System Status Information:

  • LC Homepage:
    • hpc.llnl.gov (User Portal toggle) - just look on the main page for the System Status links (shown at right).
    • The same links appear under the Hardware menu.
    • Unclassified systems only
  • MyLC Portal:
  • Machine status email lists:
    • Provide the most timely status information for system maintenance, problems, and system changes/updates
    • ocf-status and scf-status cover all machines on the OCF / SCF
    • Additionally, each machine has its own status list - for example:
      sierra-status@llnl.gov
  • Login banner & news items - always displayed immediately after logging in
    • Login banner includes basic configuration information, announcements and news items. Example login banner HERE.
    • News items (unread) appear at the bottom of the login banner. For usage, type news -h.
  • Direct links for systems and file systems status pages:
Description Network Links
System status web pages OCF CZ https://lc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
OCF RZ https://rzlc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
SCF https://lc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
File Systems status web pages OCF CZ https://lc.llnl.gov/fsstatus/fsstatus.cgi
OCF RZ https://rzlc.llnl.gov/fsstatus/fsstatus.cgi
OCF CZ+RZ https://rzlc.llnl.gov/fsstatus/allfsstatus.cgi
SCF https://lc.llnl.gov/fsstatus /fsstatus.cgi

Running Jobs

Overview

  • A brief summary of running jobs is provided below, with more detail in sections that follow.

Very Different From Other LC Systems:

  • Although Sierra systems share a number of similarities with other LC clusters, running jobs is very different.
  • IBM Spectrum LSF is used as the Workload Manager instead of Slurm:
    • Entirely new command set for submitting, monitoring and interacting with jobs.
    • Entirely new command set for querying the system's configuration, queues, job statistics and accounting information.
    • New syntax for creating job scripts.
  • The jsrun command is used to launch jobs instead of Slurm's srun command:
    • Developed by IBM for the LLNL and Oak Ridge CORAL systems.
    • Command syntax is very different.
    • New concept of resource sets for defining how a node looks to a job.
  • There are both login nodes and launch nodes:
    • Users login to login nodes, which are shared by other users. Intended for interactive activities such as editing files, compiling, submitting batch/interactive jobs, running GUIs. Not intended for running production, parallel jobs.
    • Batch/interactive jobs submitted from a login node are executed on a launch node. Launch nodes are shared among user jobs.
    • Serial jobs, not using the jsrun command run on the launch node.
    • Parallel jobs using the jsrun command will run on compute nodes, which are dedicated to a single user's job. Note that the jsrun command itself still executes on the launch node though.

Accounts and Allocations:

  • In order to run jobs on any LC system, users must have a valid login account.
  • Additionally, users must have a valid allocation (bank) on the system.

Queues:

  • As with other LC systems, compute nodes are divided into queues:
    • pbatch: contains the majority of compute nodes; where most production work is done; larger job size and time limits.
    • pdebug: contains a smaller subset of compute nodes; intended for short, small debugging jobs.
    • Other queues are often configured for specific purposes.
  • Real production work must run in a compute node queue, not on a login or launch node.
  • Each queue has specific limits that can include:
    • Default and maximum number of nodes that a job may use
    • Default and maximum amount of time a job may run
    • Number of jobs that may run simultaneously
    • Other limits and restrictions as configured by LC

Batch Jobs - General Workflow:

  1. Login to a login node.
  2. Create / prepare executables and associated files.
  3. Create an LSF job script.
  4. Submit the job script to LSF with the bsub command. For example:
    bsub < myjobscript
  5. LSF will migrate the job to a launch node and acquire the requested allocation of compute nodes from the requested queue. If not specified, the default queue (usually pbatch) will be used.
  6. The jsrun command is used within the job script to launch the job on compute nodes. If jsrun is not used, then the job will run on the launch node only.
  7. Monitor and interact with the job from a login node using the relevant LSF commands.

Interactive Jobs - General Workflow:

  1. Login to a login node.
  2. Create / prepare executables and associated files.
  3. From the login node command line, request an interactive allocation of compute nodes from LSF with the bsub command. For example:
    bsub -nnodes 16 -Ip -G guests -q pdebug /usr/bin/tcsh
    Requests 16 nodes, Interactive pseudo-terminal, guests account, pdebug queue, running the tcsh shell.
  4. LSF will migrate the job to a launch node and acquire the requested allocation of compute nodes from the requested queue. If not specified, the default queue (usually pbatch) will be used.
  5. When ready, an interactive terminal session will begin on the launch node.
  6. From here, shell commands, scripts or parallel jobs can be executed from the launch node:
    Parallel jobs are launched with the jsrun command from the shell command line or from within a user script. Will execute on the allocated compute nodes.
    Non-jsrun jobs will run on the launch node only.
  7. LSF commands can be used to monitor and interact with the job, either from a login node or the launch node.

Running Jobs

Batch Scripts and #BSUB / bsub

LSF Batch Scripts:

  • As with all other LC systems, running batch jobs requires the use of a batch job script:
    • Plain text file created by the user to describe job requirements, environment and execution logic
    • Commands, directives and syntax specific to a given batch system
    • Shell scripting
    • References to environment and script variables
    • The application(s) to execute along with input arguments and options
  • What makes Sierra systems different is that IBM Spectrum LSF is used as the Workload Manager instead of Slurm:
    • Batch scripts are required to use LSF #BSUB syntax
    • Shell scripting, environment variables, etc. are the same as other batch scripts
  • An example LSF batch script is shown below. The #BSUB syntax is discussed next.

    #!/bin/tcsh
    ### LSF syntax
    #BSUB -nnodes 8                   #number of nodes
    #BSUB -W 120                      #walltime in minutes
    #BSUB -G guests                   #account
    #BSUB -e myerrors.txt             #stderr
    #BSUB -o myoutput.txt             #stdout
    #BSUB -J myjob                    #name of job
    #BSUB -q pbatch                   #queue to use

    ### Shell scripting
    date; hostname
    echo -n 'JobID is '; echo $LSB_JOBID
    cd /p/gscratch1/joeuser/project
    cp ~/inputs/run2048.inp .

    ### Launch parallel executable
    jsrun -n16 -r2 -a20 -g2 -c20 myexec

    echo 'Done'

  • Usage notes:
    • The #BSUB keyword is case sensitive
    • The jsrun command is used to launch parallel jobs

#BSUB / bsub:

  • Within a batch script, #BSUB keyword syntax is used to specify LSF job options.
  • The bsub command is then used to submit the batch script to LSF for execution. For example:
    bsub < mybatchscript
    Note the use of input redirection to submit the batch script. This is required.
  • The exact same options specified by #BSUB in a batch script can be specified on the command line with the bsub command. For example:
    bsub -q pdebug < mybatchscript
  • If bsub and #BSUB options conflict, the command line option will take precedence.
  • The table below lists some of the more common #BSUB / bsub options.
    For other options and more in-depth information, consult the bsub man page and/or the LSF documenation.
Common BSUB Options
Option Example
Can be used with bsub command also
Description
-B #BSUB -B Send email when job begins
-b #BSUB -b 15:00 Dispatch the job for execution on or after the specified date and time. - in this case 3pm. Time format is [[[YY:]MM:]DD:]hh:mm
-cwd #BSUB -cwd /p/gscratch1/joeuser/ Specifies the current working directory for job execution. The default is the directory from where the job was submitted.
-e #BSUB -e mystderr.txt
#BSUB -e joberrors.%J
#BSUB -eo mystderr.txt
File into which job stderr will be written. If used, %J will be replaced with the job ID number. If the file exists, it will be appended by default. Use -eo to overwrite. If -e is not used, stderr will be combined with stdout in the stdout file by default.
-G #BSUB -G guests At LC this option specifies the account to be used for the job. Required.
-H #BSUB -H Holds the job in the PSUSP state when the job is submitted. The job is not scheduled until you tell the system to resume the job using the bresume command.
-i #BSUB -i myinputfile.txt Gets the standard input for the job from specified file path.
-Ip bsub -Ip -G guests /bin/tcsh Interactive only. Submits an interactive job and creates a pseudo-terminal when the job starts. See the Interactive Jobs section for details.
-J #BSUB -J myjobname Specifies the name of the job. Default name is the name of the job script.
-N #BSUB -N Send email when job ends
-nnodes #BSUB -nnodes 128 Number of nodes to use
-o #BSUB -o myoutput.txt
#BSUB -o joboutput.%J
#BSUB -oo myoutput.txt
File into which job stdout will be written. If used, %J will be replaced with the job ID number. Default output file name is jobid.out. stderr is combined with stdout by default. If the output file already exists, it is appended by default. Use -oo to overwrite.
-q #BSUB -q pdebug Specifies the name of the queue to use
-r
-rn
#BSUB -r
#BSUB -rn
Rerun the job if the system fails. Will not rerun if the job itself fails. Use -rn to never rerun the job.
-W #BSUB -W 60 Requested maximum walltime - 60 minutes in the example shown.
Format is [hours:]minutes, not [[hours:]minutes:]seconds like Slurm
-w #BSUB -w ended(22438) Specifies a job dependency - in this case, waiting for jobid 22438 to complete. See the man page and/or documentation for dependency expression options.
-XF #BSUB -XF Use X11 forwarding

What Happens After You Submit Your Job?:

  • As shown previously, the bsub command is used to submit your job to LSF from a login node. For example:

    bsub  <  mybatchscript

  • If successful, LSF will migrate execution of your script to a launch node.
  • An allocation of compute nodes will be acquired for your job in a batch queue - either one specified by you, or the default queue.
  • The jsrun command is used from within your script to launch your job on the allocation of compute nodes. Your executable then runs on the compute nodes.
  • Without jsrun, your executable will run on the launch node - not typically desired, but certainly possible to do.

Environment Variables:

  • By default, LSF will import most (if not all) of your environment variables so they are available to your job.
  • If for some reason you are missing environment variables, you can use the #BSUB/bsub -env option to specify variables to import. See the man page for details.
  • Additionally, LSF provides a number of its own environment variables. Some of these may be useful for querying purposes within your batch script. The table below lists a few common ones.
Variable Description
LSB_JOBID The ID assigned to the job by LSF
LSB_JOBNAME The job's name
LS_JOBPID The job's process ID
LSB_JOBINDEX The job's index (if it belongs to a job array)
LSB_HOSTS The hosts assigned to run the job
LSB_QUEUE The queue from which the job was dispatched
LS_SUBCWD The directory from which the job was submitted
  • To see the entire list of LSF environment variables, simply use a command like printenv, set or setenv (shell dependent) in your batch script, and look for variables that start with LSB_ or LS_.

Running Jobs

Interactive Jobs

  • Interactive jobs are often useful for quick debugging and testing purposes:
    • Allow you to acquire an allocation of compute nodes that can be interacted with from the shell command line.
    • No handing things over to LSF, and then waiting for the job to complete.
    • Easy to experiment with multiple "on the fly" runs.
  • There are two main "flavors" of interactive jobs:
    • Pseudo-terminal shell - uses your existing SSH login window
    • Xterm - launches a new window using your default login shell
  • Example: Starting a pseudo-terminal interactive job:

    sierra4358% bsub -nnodes 4 -Ip -XF -W 180 -G guests /bin/tcsh
    Job <22524> is submitted to default queue <pbatch>.
    <<ssh X11 forwarding job>>
    <<Waiting for dispatch ...>>
    <<Starting on sierra4371>>

    sierra4371% nodeattr -c launch
    sierra4367,sierra4368,sierra4369,sierra4370,sierra4371

    sierra4371% bjobs -X
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    22524   blaise  RUN   pbatch     sierra4358  1*sierra437 /bin/tcsh  Mar 29 15:55
                                                 40*sierra402
                                                 40*sierra405
                                                 40*sierra408
                                                 40*sierra409
    sierra4371%

From a login node, the bsub command is used to request 4 nodes in an Interactive pseudo-terminal, X11 Forwarding, Wall clock limit of 180 minutes, with the guests account in a tcsh shell.
After the dispatch the interactive session starts on a different node.
The nodeattr command is used to verify it is a launch node.
The bjobs -X command is used to display the compute nodes allocated for this job.

  • Example: Starting an xterm interactive job:

    sierra4358% bsub -nnodes 4 -XF -W 180 -G guests xterm -sb -ls -fn ergo17 -rightbar
    Job <22530> is submitted to default queue <pbatch>.
    <<ssh X11 forwarding job>>
    <<Waiting for dispatch ...>>
    sierra4358%
    [ xterm running on a launch node appears on screen at this point ]

Similar, but opens a new xterm window on a launch node instead of a tcsh shell in the existing window.
The xterm options follow the xterm command.

  • How it works:
    • Issuing the bsub command from a login node results in control being dispatched to a launch node.
    • An allocation of compute nodes is acquired with the -nnodes option. If not specified, the default is one node.
    • The compute node allocation will be in the default queue, usually pbatch. The desired queue can be explicitly specified with the bsub -q option.
    • When ready, your pseudo-terminal or xterm session will run on the launch node. From there, you can use the jsrun command to launch parallel tasks on the compute nodes.
    • Without jsrun, commands will execute on the launch node.
  • Usage notes:
    • Most of the other bjob options not shown should work as expected.
    • In addition to the -Ip option, LSF provides -Is, -IS, -ISp, -ISs options which seem similar. See the man page/documentation for details. Your mileage may vary.
    • Exiting the pseudo-terminal shell, or the xterm, will terminate the job.

Running Jobs

jsrun Command and Resource Sets

jsrun Overview:

  • The jsrun command is the parallel job launch command for Sierra systems.
  • Replaces srun and mpirun used on other LC systems:
    • Similar in function, but very different conceptually and in syntax.
    • Based upon an abstraction called resource sets.
  • Basic syntax (described in detail below):
    jsrun  [options]  [executable]
  • Developed by IBM for the LLNL and Oak Ridge CORAL systems:
    • Part of the IBM Job Step Manager (JSM) software package for managing a job allocation provided by the resource manager.
    • Integrated into the IBM Spectrum LSF Workload Manager.

Resource Sets:

  • A Sierra node consists of the following resources per node - see diagram at right:
    • 40 cores; 20 per socket   Note Two cores on each socket are reserved for the operating system, and are therefore not included.
    • 160 hardware threads; 4 per core
    • 4 GPUs; 2 per socket
  • In the simplest sense, a resource set describes how a node's resources should look to a job.
  • A basic resource set definition consists of:
    • Number of tasks
    • Number of cores
    • Number of GPUs
    • Memory allocation
  • Rules:
    • Described in terms of a single node's resources
    • Can span sockets on a node
    • Cannot span multiple nodes
    • Defaults are used if any resource is not explicitly specified.
  • Example Resource Sets:
4 tasks ♦ 4 cores ♦ 1 GPU
Fits on 1 socket
4 tasks ♦ 16 cores ♦ 2 GPUs
Fits on 1 socket
16 tasks ♦ 16 cores ♦ 4 GPUs
Requires both sockets
  • After defining the resource set, you need to define:
    • The number of Nodes required for the job
    • How many Resource Sets should be on each node
    • The total number of Resource Sets for the entire job
  • These parameters are then provided to the jsrun command as options/flags.
  • Examples with jsrun options shown:
Resource Set
4 tasks ♦ 4 Cores ♦ 1 GPU
-a4 -c4 -g1
2 nodes
4 resource sets per node  ♦  8 resource sets total
-r4 -n8
Resource Set
4 Tasks ♦  16 Cores ♦  2 GPUs
-a4 -c16 -g2
2 nodes
2 resource sets per node  ♦  4 resource sets total
-r4 -n4

 

jsrun Options:

Option (short) Option (long) Description
-a --tasks_per_rs Number of tasks per resource set
-b --bind Binding of tasks within a resource set. Can be none, rs, or packed:#
-c --cpu_per_rs Number of CPUs (cores) per resource set.
-d --launch_distribution Specifies how task are started on resource sets. Options are cyclic, packed, plane:#. See the man page for details.
-E
-F
-D
--env var
--env_eval
--env_no_propagate
Specify how to handle environment variables. See the man page for details.
-g --gpu_per_rs Number of GPUs per resource set
-l --latency priority Latency Priority. Controls layout priorities. Can currently be cpu-cpu, gpu-cpu, gpu-gpu, memory-memory, cpu-memory or gpu-memory. See the man page for details.
-n --nrs Total number of resource sets for the job.
-m --memory_per_rs Specifies the number of megabytes of memory (1,048,756 bytes) to assign to a resource set. Use the -S option to view the memory setting.
-p --np Number of tasks to start. By default, each task is assigned its own resource set that contains a single CPU.
-r --rs_per_host Number of resource sets per host (node)
-S filename --save_resources Specifies that the resources used for the job step are writen to filename.
-t
-o
-e
-k
--stdio_input
--stdio_stdout
--stdio_mode
--stdio_stderr
Specifies how to handle stdio, stdout and stderr. See the man page for details.
-V --version Displays the version of jsrun Job Step Manager (JSM).
  • Examples:
    These examples assume that 40 cores per node are available for user tasks (4 are reserved for the operating system), and each node has 4 GPUs.
    White space between an option an its argument is optional.
jsrun Command Description Diagram
jsrun -p72 a.out 72 tasks, no GPUs
2 nodes, 40 tasks on node1, 32 tasks on node2
jsrun -n8 -a1 -c1 -g1 a.out 8 resource sets, each with 1 task and 1 GPU
2 nodes, 2 tasks per socket
jsrun -n8 -a1 -c4 -g1 -bpacked:4 a.out 8 resource sets each with 1 task with 4 threads (cores) and 1 GPU
2 nodes, 2 tasks per socket
jsrun -n8 -a2 -c2 -g1 a.out 8 resource sets each with 2 tasks and 1 GPU
2 nodes, 4 tasks per socket
jsrun -n4 -a1 -c1 -g2 a.out 4 resource sets each with 1 task and 2 GPUs
2 nodes: 1 task per socket

Running Jobs

Job Dependencies

#BSUB -w Option:

  • As with other batch systems, LSF provides a way to place dependencies on jobs to prevent them from running until other jobs have started, completed, etc.
  • The #BSUB -w option is used to accomplish this. The syntax is:
#BSUB -w  dependency_expression
  • A dependency expression is a logical expression comprised of one or more dependency conditions. It can include relational operators such as:
    && (AND)          || (OR)            ! (NOT)
    >                 >=                 <
    <=                ==                 !=
  • Several dependency examples are shown in the table below:
Example Description
#BSUB -w started(22345) Job will not start until job 22345 starts. Job 22345 is considered to have started if is in any of the following states: USUSP, SSUSP, DONE, EXIT or RUN (with any pre-execution command specified by bsub -E completed)
#BSUB -w done(22345)
#BSUB -w 22345
Job will not start until job 22345 has a state of DONE (completed normally). If a job ID is given with no condition, done() is assumed.
#BSUB -w exit(22345) Job will not start until job 22345 has a state of EXIT (completed abnormally)
#BSUB -w ended(22345) Job will not start until job 22345 has a state of EXIT or DONE
#BSUB -w done(22345) && started(33445) Job will not start until job 22345 has a state of DONE and job 33445 has started
  • Usage notes:
    • The -w option can be used with the bsub command, but it is extremely limited because parens and relational operators cannot be included with the command.
    • LSF requires that valid jobids be specified - can't use non-existent jobids.

bjdepinfo Command:

  • The bjdepinfo command can be used to view job dependency information. More useful than the bjobs -l command.
  • See the bjdepinfo man page and/or the LSF Documentation for details.
  • Examples are shown below:
    % bjdepinfo 30290
    JOBID          PARENT         PARENT_STATUS  PARENT_NAME  LEVEL
    30290          30285          RUN            *mmat 500    1


    % bjdepinfo -r3 30290
    JOBID          PARENT         PARENT_STATUS  PARENT_NAME  LEVEL
    30290          30285          RUN            *mmat 500    1
    30285          30271          DONE           *mmat 500    2
    30271          30267          DONE           *mmat 500    3

Monitoring Jobs

  • LSF provides several commands for monitoring jobs. The most useful one is the bjobs command.
  • Additionally LC provides a locally developed command for monitoring jobs called lsfjobs.

bjobs:

  • Provides a number of options for displaying a range of job information - from summary to detailed.
  • The table below shows some of the more commonly used options.
  • See the bjobs man page and/or the LSF Documentation for details.
Command Description Example
bjobs Show your currently queued and running jobs

bjobs -u all Show queued and running jobs for all users

bjobs -a Show jobs in all states including recently completed

bjobs -d Show only recently completed jobs
bjobs -l
bjobs -l 22334
bjobs -l -u all
Show long listing of detailed job information
Show long listing for job 22334
Show long listing for all user jobs

bjobs -o [format string] Specifies options for customized format bjobs output. See the documentation for details.  
bjobs -p
bjobs -p -u all
Show pending jobs and reason why
Show pending jobs for all users

bjobs -r
bjobs -r -u all
Show running jobs
Show running jobs for all users

bjobs -X Show host names (uncondensed)

lsfjobs:

  • LC's lsfjobs command is useful for displaying a summary of queued and running jobs, along with a summary of each queue's usage.
  • No documentation is available at this time, however using the command lsfjobs -help will display usage information, available HERE.
    • Various options are available for filtering output by user, group, jobid, queue, job state, completion time, etc.
    • Job states are described - approx. 10 different states possible.
  • Example output is shown below.
    % lsfjobs

     ********************************
     * Host:  - sierra - sierra4359 *
     * Date: 04/17/2018 15:39:59    *
     * lsfjobs                      *
     ********************************


     *******************************************************************************************************
     * JOBID    PROCS PTILE NODES USER       STATE       PRIO     QUEUE     GROUP    REMAINING       LIMIT *
     *******************************************************************************************************
       35609      640    40    16 user3444     RUN          -    pbatch    guests      0:02:00    01:12:00                
       35614       40    40     1 user22       RUN          -    pbatch    guests      0:48:00    01:00:00                
       35615    31400    40   785 user38       RUN          -    pbatch    guests      0:49:00    01:00:00                

     *******************************************************************************************************
     * QUEUE           NODE GROUP      Total   Down    Busy   Free  NODES                                  *
     *******************************************************************************************************
       batch_hosts     -                1044    242     802      0  sierra[397-720,1081-1440,1801-2160]
       debug_hosts     -                  36     36       0      0  sierra[361-396]
       ibm_hosts       -                3007    271       0   2736  sierra[1-115,117-192,194-201,203-360,
    721-791,937-959,961-991,993-1043,1045-1051,1053-1062,1064-1080,1441-1535,1537-1584,1603-1800,2161-2197,
    2199-2441,2443-2448,2485-2506,2508-2509,2511-2520,2523,2533,2539-2557,2559-2753,2755-2911,2913-2915,2917-
    3136,3138-4320]


     *******************************************************************************************************
     * QUEUE          Total  Down   Busy  Free   DefaultTime          MaxTime   NODE GROUP(S)              *
     *******************************************************************************************************
       exempt          1080   278    802     0          None        Unlimited    batch_hosts,debug_hosts
       expedite        1080   278    802     0         30:00          8:00:00    batch_hosts,debug_hosts
       pall            1080   278    802     0          None        Unlimited    batch_hosts,debug_hosts
       pbatch          1044   242    802     0         30:00         16:00:00    batch_hosts
       pdebug            36    36      0     0         30:00          1:00:00    debug_hosts
       pibm            3007   271      0  2736         30:00          8:00:00    ibm_hosts
       standby         1080   278    802     0          None        Unlimited    batch_hosts,debug_hosts

bpeek:

  • Allows you to view stdout/stderr of currently running jobs.
  • Provides several options for selecting jobs by queue, name, jobid.
  • See the bpeek man page and/or LSF documentation for details.
  • Examples below
Command Description
bpeek 27239 Show output from jobid 27239
bpeek -J myjob Show output for most recent job named "myjob"
bpeek -f Shows output of most recent job by looping with the command tail -f. When the job is done, the bpeek command exits.
bpeek -q Displays output of the most recent job in the specified queue.

bhist:

  • By default, displays information about your pending, running, and suspended jobs.
  • Also provides options for displaying information about recently completed jobs, and for filtering output by job name, queue, user, group, start-end times, and more.
  • See the bhist man page and/or LSF documentation for details.
  • Example below - shows running, queued and recently completed jobs:
    % bhist -a
    Summary of time in seconds spent in various states:
    JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
    27227   user22  run.245   2       0       204     0       0       0       206      
    27228   user22  run.247   2       0       294     0       0       0       296      
    27239   user22  runtest   4       0       344     0       0       0       348      
    27240   user22  run.248   2       0       314     0       0       0       316      
    27241   user22  runtest   1       0       313     0       0       0       314      
    27243   user22  run.249   13      0       1532    0       0       0       1545     
    27244   user22  run.255   0       0       186     0       0       0       186      
    27245   user22  run.267   1       0       15      0       0       0       16       
    27246   user22  run.288   2       0       12      0       0       0       14       

Job States:

  • LSF job monitoring commands display a job's state. The most commonly seen ones are shown in the table below.
State Description
DONE Job completed normally
EXIT Job completed abnormally
PEND Job is pending, queued
PSUSP Job was suspended (either by the user or an administrator) while pending
RUN Job is running
SSUSP Job was suspended by the system after starting
USUSP Job was suspended (either by the user or an administrator) after starting

Interacting With Jobs

Suspending / Resuming Jobs

bstop and bresume Commands:

  • LSF provides support for user-level suspension and resumption of jobs.
  • The bstop command is used to suspend both queued and running jobs:
    • Running jobs will show a USUSP state following suspension
    • Queued jobs will show a PSUSP state
  • The bresume command is used to resume jobs that have been suspended.
  • Jobs to suspend / resume can be specified by jobid, host, job name, group, queue and other criteria. In the examples below, jobid is used.
  • See the bstop man page, bresume man page and/or LSF documentation for details.

Suspend and then resume a running job

    % bjobs -X
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31403   user22  RUN   pdebug     sierra4360  1*sierra436 bmbtest    Apr 13 12:02
                                                 40*sierra361
                                                 40*sierra362
    % bstop 31403
    Job <31403> is being stopped

    % bjobs -X
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31403   user22  USUSP pdebug     sierra4360  1*sierra436 bmbtest    Apr 13 12:02
                                                 40*sierra361
                                                 40*sierra362
    % bresume 31403
    Job <31403> is being resumed

    % bjobs -X
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31403   user22  RUN   pdebug     sierra4360  1*launch_ho bmbtest    Apr 13 12:02
                                                 40*sierra361
                                                 40*sierra362

Suspend a queued job, and then resume

    % bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31411   user22  PEND  pdebug     sierra4360              bmbtest    Apr 13 12:11

    % bstop 31411
    Job <31411> is being stopped

    % bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31411   user22  PSUSP pdebug     sierra4360              bmbtest    Apr 13 12:11

    % bresume 31411
    Job <31411> is being resumed

    % bjobs
    bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    31411   user22  RUN   pdebug     sierra4360  1*launch_ho bmbtest    Apr 13 12:11
                                                 400*debug_hosts

Interacting With Jobs

Modifying Jobs

bmod Command:

  • The bmod command is used to modify the options of a previously submitted job.
  • Simply use the desired bsub option with bmod, providing a new value. For example, to modify the wallclock time for jobid 22345:
bmod -W 500 22345>
  • You can modify all options for a pending job, even if the corresponding bsub command option was not specified. This comes in handy in case you forgot an option when the job was originally submitted.
  • You can also "reset" options to their original or default values by appending a lowercase n to the desired option (no whitespace). For example to reset the queue to the original submission value:
    bmod -qn 22345
  • The bhist -l command can be used to view a history of which job parameters have been changed - they appear near the end of the output. For example:

    % bhist -l 31788

    ...[previous output omitted]

    Fri Apr 13 14:10:20: Parameters of Job are changed:
        Output file change to : /g/g0/user22/lsf/
        User group changes to: guests
        run limit changes to : 55.0 minutes;
    Fri Apr 13 14:13:40: Parameters of Job are changed:
        Job queue changes to : pbatch
        Output file change to : /g/g0/user22/lsf/
        User group changes to: guests;
    Fri Apr 13 14:30:08: Parameters of Job are changed:
        Job queue changes to : standby
        Output file change to : /g/g0/user22/lsf/
        User group changes to: guests;

    ...[following output omitted]

  • For running jobs, there are very few, if any, useful options that can be changed.
  • See the bmod man page and/or LSF documentation for details.

Interacting With Jobs

Signaling / Killing Jobs

bkill Command:

  • The bkill command is used to both terminate jobs and to send signals to jobs.
  • Similar to the kill command found in Unix/Linux operating systems - can be used to send various signals (not just SIGTERM and SIGKILL) to jobs.
  • Can accept both numbers and names for signals.
  • In additional to jobid, jobs can be identified by queue, host, group, job name, user, and more.
  • For a list of accepted signal names, run bkill -l
  • See the bkill man page and/or LSF documentation for details.
    For general details on Linux signals see http://man7.org/linux/man-pages/man7/signal.7.html.
  • Examples:
Command Description
bkill 22345
bkill 34455 24455
Force a job(s) to stop by sending SIGINT, SIGTERM, and SIGKILL. These signals are sent in that order, so users can write applications such that they will trap SIGINT and/or SIGTERM and exit in a controlled manner.
bkill -s HUP 22345 Send SIGHUP to job 22345. Note When specifying a signal by name, omit SIG from the name.
bkill -s 9 22345 Send signal 9 to job 22345
bkill -s STOP -q pdebug Send a SIGSTOP signal to the most recent job in the pdebug queue

LSF - Additional Information

LSF Documentation:

LSF Configuration Commands:

  • LSF provides several commands that can be used to display configuration information, such as:
    • LSF system configuration parameters: bparams
    • Job queues: bqueues
    • Batch hosts: bhosts and lshosts
  • These commands are described in more detail below.
bparams Command:
  • This command can be used to display the many configuration options and settings for the LSF system. Currently over 180 parameters.
  • Probably of most interest to LSF adminstrators/managers.
  • Examples:
  • See the bparams man page and/or LSF documentation for details.
bqueues Command:
  • This command can be used to display information about the LSF queues
  • By default, returns one line of information for each queue.
  • Provides several options, including a long listing -l.
  • Examples:
    % bqueues
    QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
    pall             60  Open:Active       -    -    -    -     0     0     0     0
    expedite         50  Open:Active       -    -    -    -     0     0     0     0
    pbatch           25  Open:Active       -    -    -    - 32083     0 32083     0
    exempt           25  Open:Active       -    -    -    -     0     0     0     0
    pdebug           25  Open:Active       -    -    -    -     0     0     0     0
    pibm             25  Open:Active       -    -    -    -     0     0     0     0
    standby           1  Open:Active       -    -    -    -     0     0     0     0

Long listing format:

  • See the bqueues man page and/or LSF documentation for details.
bhosts Command:
  • This command can be used to display information about LSF hosts.
  • By default, returns a one line summary for each host group.
  • Provides several options, including a long listing -l.
  • Examples:
    % bhosts
    HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
    batch_hosts        ok              -  45936  32080  32080      0      0      0
    debug_hosts        unavail         -   1584      0      0      0      0      0
    ibm_hosts          ok              - 132286      0      0      0      0      0
    launch_hosts       ok              -  49995      3      3      0      0      0
    sierra4372         closed          -      0      0      0      0      0      0
    sierra4373         unavail         -      0      0      0      0      0      0

Long listing format:

  • See the bhosts man page and/or LSF documentation for details.
lshosts Command:
  • This is another command used for displaying information about LSF hosts.
  • By default, returns a one line of information for every LSF host.
  • Provides several options, including a long listing -l.
  • Examples:
    % lshosts
    HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
    sierra4372  LINUXPP   POWER9 250.0    32 251.5G   3.9G    Yes (mg)
    sierra4373  UNKNOWN   UNKNOWN  1.0     -      -      -    Yes (mg)
    sierra4367  LINUXPP   POWER9 250.0    32 570.3G   3.9G    Yes (LN)
    sierra4368  LINUXPP   POWER9 250.0    32 570.3G   3.9G    Yes (LN)
    sierra4369  LINUXPP   POWER9 250.0    32 570.3G   3.9G    Yes (LN)
    sierra4370  LINUXPP   POWER9 250.0    32 570.3G   3.9G    Yes (LN)
    sierra4371  LINUXPP   POWER9 250.0    32 570.3G   3.9G    Yes (LN)
    sierra1     LINUXPP   POWER9 250.0    44 255.4G      -    Yes (CN)
    sierra10    LINUXPP   POWER9 250.0    44 255.4G      -    Yes (CN)
    ...
    ...

Long listing format:

  • See the lshosts man page and/or LSF documentation for details.

Math Libraries

ESSL:

  • IBM's Engineering and Scientific Subroutine Library (ESSL) is a collection of high-performance subroutines providing a wide range of highly optimized mathematical functions for many different scientific and engineering applications, including:
    • Linear Algebra Subprograms
    • Matrix Operations
    • Linear Algebraic Equations  Eigensystem Analysis
    • Fourier Transforms
    • Sorting and Searching  Interpolation
    • Numerical Quadrature
    • Random Number Generation
  • Location: the ESSL libraries are available through modules. Use the module avail command to see what's available, and then load the desired module. For example:
    % module avail essl

    ------------------------- /usr/tcetmp/modulefiles/Core -------------------------
       essl/5.5    essl/6.1.0 (D)

    % module load essl/6.1.0

    % module list

    Currently Loaded Modules:
      1) xl/beta-2018.03.21        3) cuda/9.0.176   5) essl/6.1.0
      2) spectrum-mpi/2017.04.03   4) StdEnv
  • In the "Guide and Reference" documentation, some useful references are:
    • Chapter 5 for compile examples
    • Appendix B for a list of LAPACK functions supported by ESSL
    • For CUDA, search for a section labeled "Using the ESSL SMP CUDA Library"

IBM's Mathematical Acceleration Subsystem (MASS) Libraries:

  • The IBM XL C/C++ and XL Fortran compilers include a set of highly tuned libraries for mathematical intrinsic functions (sin, log, tan, cos, sqrt, etc.).
  • Typically provide significant performance improvement over the standard system math library routines.
  • Three different versions are available:
    • Scalar - libmass.a
    • Vector - libmassv.a
    • SIMD - libmass_simdp8.a (POWER8) and libmass_simdp9.a (POWER9)
  • Location: /opt/ibm/xlmass/version#
  • How to use:
    • Automatic through compiler options
    • Explicit by including MASS routines in your source code
  • Automatic usage:
    • Compile using any of these sets of compiler options:
C/C++ Fortran
-qhot -qignerrno -qnostrict
-qhot -qignerrno -qstrict=nolibrary
-qhot -O3
-O4
-O5
-qhot -qnostrict
-qhot -O3 -qstrict=nolibrary
-qhot -O3
-O4
-O5
  • The IBM XL compilers will automatically attempt to vectorize calls to system math functions by using the equivalent MASS vector functions
  • If the vector function can't be used, then the compiler will attempt to use the scalar version of the function
  • Does not apply to the SIMD library functions
  • Explicit usage:
    • Familiarize yourself with the MASS routines by consulting the relevant IBM documentation (see below)
    • Include selected MASS routines in your source code
    • Include the relevant mass*.h in your source files (see MASS documentation)
    • Link with the required MASS library/libraries - no Libpath needed. For example:
        xlc myprog.c -o myprog -lmass -lmassv
        xlf myprog.f -o myprog -lmass -lmassv
        mpixlc myprog.c -o myprog -lmass_simdp9
        mpixlf90 myprog.f -o myprog -lmass_simdp9

LAPACK, ScaLAPACK, BLAS, BLACS:

  • Netlib's well known Linear Algebra Packages.
  • Two flavors of these libraries are provided: GNU and IBM XL.
  • Trailing underscores:
    • By default, GNU Fortran appends an underscore to external names so the functions in the gfortran versions have trailing underscores (ex. dgemm_).
    • By default the IBM XL does not append trailing underscores.
    • Some codes may explicitly append underscores to some external functions, expecting to be linked with a library that provides functions with trailing underscores, ie. the GNU gfortran libraries.
    • If your application ends up looking for the name without the trailing underscore, try linking with the XLF library.
    • To handle other specific cases, the gfortran compiler provides the -fnounderscoring option to look for external functions without trailing underscores.
    • The XL compilers provide the -qextname<=name> option to append trailing underscores to all or specifically named global entities.
  • Compiling/Linking:
Library Compiler Flags
BLAS GNU -L/usr/tcetmp/packages/blas/blas-3.6.0-gfortran-4.8.5/lib -lblas
  XL -L/usr/tcetmp/packages/blas/blas-3.6.0-xlf-15.1.5/lib -lblas
LAPACK GNU -L/usr/tcetmp/packages/lapack/lapack-3.6.0-gfortran-4.8.5/lib -llapack
  XL -L/usr/tcetmp/packages/lapack/lapack-3.6.0-xlf-15.1.5/lib -llapack
SCALAPACK GNU -L/usr/tcetmp/packages/scalapack/scalapack-2.0.2-gfortran-4.8.5/lib -lscalapack
  XL -L/usr/tcetmp/packages/scalapack/scalapack-2.0.2-xlf-15.1.5/lib -lscalapack
  • These are provided as an aid to porting, and have not been optimized for POWER/NVIDIA architectures. They do not use IBM's tuned BLAS routines as ESSL does.
  • See www.netlib.org/lapack for more information.

FFTW:

  • "Fastest Fourier Transform in the West"
  • FFTW is a C subroutine library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data.
  • Multiple versions located under /usr/tcetmp/packages/fftw/
  • These are provided as an aid to porting and have not been optimized for POWER/NVIDIA architectures.
  • Additional information:

PETSc:

  • "Portable, Extensible Toolkit for Scientific Computation"
  • Provides a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, and GPUs through CUDA or OpenCL, as well as hybrid MPI-GPU parallelism.
  • Location:
    • Current locations as of May, 2018 are shown - subject to change as versions change.
    • /usr/tcetmp/packages/petsc/petsc-3.8.3
    • Use module load petsc/3.8.3 to set the PETSC_DIR environment variable and put the ${PETSC_DIR}/bin directory in your PATH.
  • Documentation:
  • PETSc website: https://www.mcs.anl.gov/petsc/
  • GPU support: https://www.mcs.anl.gov/petsc/features/gpus.html

GSL - GNU Scientific Library:

  • Provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.
  • Location: multiple versions located under /usr/tcetmp/packages/gsl/
  • GNU documentation: https://www.gnu.org/software/gsl/

NVIDIA CUDA Tools:

  • The NVIDIA CUDA toolkit comes with several math libraries, which are described in the CUDA toolkit documentation: https://docs.nvidia.com/cuda/.
    • Intended to be replacements for existing CPU math libraries that execute on the GPU, without requiring the user to explicitly write any GPU code.
    • Note that the GPU-based IBM ESSL routines are built on libraries like cuBLAS and in certain cases may take better advantage of the CPU and multiple GPUs together than a pure CUDA program would.
  • The primary math libraries are discussed briefly below. See the CUDA Toolkit documentation for details.
    • cuBLAS: provides drop-in replacements for Level 1, 2, and 3 BLAS routines. In general, wherever a BLAS routine was being used, a cuBLAS routine can be applied instead. cuBLAS also provides a set of extensions that perform BLAS-like operations.
    • cuSPARSE: provides a set of operations for sparse matrix operations (in particular, sparse matrix-vector multiply, for example). cuSPARSE is capable of representing data in multiple formats for compatibility with other libraries, for example the compressed sparse row (CSR) format. As with cuBLAS, these are intended to be drop-in replacements for other libraries when you are computing on NVIDIA GPUs.
    • cuSOLVER: is a higher level package built on cuBLAS and cuSPARSE that is intended to provide LAPACK-like operations. The documentation provides some examples of usage.
    • cuFFT: provides FFT operations as replacements for programs that were using existing CPU libraries. The documentation includes a table indicating how to convert from FFTW to cuFFT, and a description of the FFTW interface to cuFFT.
    • cuRAND: is a set of tools for pseudo-random number generation.
    • Thrust: provides a set of STL-like templated libraries for performing common parallel operations without explicitly writing GPU code. Common operations include sorting, reductions, saxpy, etc. It also allows you to define your own functional transformation to apply to the vector.

Code-Correctness Tools

Sierra provides a set of correctness checkers that can assist users in checking the memory, thread and MPI correctness of code:

Debugging

Both live and postmortem (a.k.a core-file) debugging is supported:

Performance Analysis Tools

We support a rich set of open-source and vendor-provided performance analysis tools, ranging from MPI tracing tools to performance profilers for both CPU and GPU code analysis.

References & Documentation

  • Author: Blaise Barney, Livermore Computing.
  • Ray cluster photos: Randy Wong, Sandia National Laboratories.
  • Sierra cluster photos: Adam Bertsch and Meg Epperly, Lawrence Livermore National Laboratory.

Livermore Computing General Documentation:

CORAL Early Access systems, POWER8, NVIDIA Pascal:

Sierra systems, POWER9, NVIDIA Volta:

LSF Documentation:

Compilers and MPI Documentation:


This completes the tutorial.

Please complete the online evaluation form.

Where would you like to go now?

LLNL-WEB-750771