Commodity Cluster Tutorial Exercises

Exercise 1

Preparation:

Login to the workshop machine
The instructor will demonstrate how to do this
After logging in, review the login banner. Specifically notice the various sections:
- Welcome section - where to get more info, help
- Announcements - All LC Machines and local machine related
- Any unread news items - trying reading a news item: news news_item
Copy the example files
First create a linux_clusters subdirectory, then copy the files, and then cd into the linux_clusters subdirectory:
```
mkdir linux_clusters
cp -R /usr/global/docs/training/blaise/linux_clusters/*   ~/linux_clusters
cd linux_clusters
```

Verify your exercise files

Issue the ls -l

drwx------ 2 class07 class07 4096 Jan 15 07:31 benchmarks
-rw------- 1 class07 class07  108 Jan 15 07:35 hello.c
-rw------- 1 class07 class07   67 Jan 15 07:35 hello.f
drwx------ 4 class07 class07 4096 Jan 15 07:31 mpi
drwx------ 4 class07 class07 4096 Jan 15 07:31 openMP
drwx------ 2 class07 class07 4096 Jan 15 07:31 pthreads
drwx------ 4 class07 class07 4096 Jan 15 07:31 serial

Configuration Information:

Before we attempt to actually compile and run anything, let's get familiar with some basic usage and configuration commands. For the most part, these commands can be used on any LC cluster.
Login Nodes

Which node are you logged into? Use the nodeattr -c login command to display the all login nodes this cluster. Recall that the generic cluster login of rotates between these to balance user logins across available nodes.
Compute Nodes and Partitions

Use the sinfo -s command to display a summary of this cluster's configuration.
- What compute partitions are available?
- How many nodes are in each partition? Active? Idle? Other/unavailable?
- Which nodes are in each partition?
Now try the sinfo command (no flags). Note that its output is similar to the sinfo -s command, but provides more detail by breaking out nodes according to their "state".
Batch Limits

All of LC's clusters have different batch limits. It's important to know these limits so that you don't submit jobs that request too many nodes or too many hours. Try the following command to view the limits for the machine you are logged into:
```
news job.lim. machinename 
```
where machinename is the actual name of your cluster.
File Systems

Use the bdf or df -h command to view available file systems. To view the Lustre parallel file systems, pipe the output into grep lscratch. For example: bdf | grep lscratch

Job Information:

Try each of the following commands, comparing and contrasting them to each other. Consult the man pages if you need more information.

Command	Description
ju	Concise summary of partitions and running jobs
mjstat	Partition summary plus one line of detailed information for each running job
squeue	One line of detailed information per running job
showq	Show all jobs, running, queued and blocked
showq -r	Show only running jobs - note additional details
showq -i	Show only queued, eligible/idle - note additional details
showq -b	Show only blocked jobs

Compilers - What's Available?

Visit the Compilers Currently Installed on LC Platforms webpage.
Look for one of LC's Linux clusters, such as cab, zin or sierra, in the section near the top of the page. Then, click on a specific cluster name/link to see additional detail for that cluster. Note that this page shows the default compilers only. Most systems have several versions of each.
Now, try the use -l compilers command to display available compilers on the cluster you're logged into. You should see GNU, Intel, and PGI compilers - several versions of each.
- Question: Which version is the default version?
- Answer: Use the dpkg-defaults command and look for the asterisk.
Also, try the use -l command to see the full list of all available packages. The list is pretty long, you may want to pipe it through more. Remember this command for later on - it'll come in handy on LC systems.

Hello World

Now try to compile your serial hello.c and/or hello.f files with any/all of the available compilers. If you're not sure which command to use:
- See the Compilers section of the tutorial.
After you've successfully built your hello world, execute it. Did it work?

Building and Running Serial Applications:

Go to either the C or Fortran versions of the serial codes:
```
cd serial/c
   or
cd serial/fortran
```
Try your hand at compiling and executing any/all of the ser_* codes with any/all of the compilers available.
Notes:
- If using gcc, you will need the -lmflag for several of the C examples
- Fortran - use ifort, pgf90 or gfortran (not F77 flavors)
- Consult the compiler man page(s) for any compiler flags you'd like to try
- A Makefile has been provided for convenience - if you use it, be sure to edit the choice of compiler/flags first.

Compiler Optimizations:

Compilers differ in their ability to optimize code. They also differ in their default level of optimization, as demonstrated by this exercise.
Review the optimize code and the opttest script so that you understand what's going on.
Execute opttest. When it completes, compare the various timings.
- Which compiler performed best/worst without optimization?
- Which compiler performed best/worst with -O3?
- Which compiler had the least/greatest difference between no opt and -O3 ?
The Intel and PGI compilers perform some optimizations by default; the GNU compilers do not. To see the effects of this, modify the opttest file to remove all occurrences of -O0and rerun the test.
Note: if you try both C and Fortran, the result differences are due to loop index variables - C starts at 0 and Fortran at 1.

This completes Exercise 1

Exercise 2

Still logged into the workshop cluster?
If so, then continue to the next step. If not, then login as you did previously for Exercise 1.

Building and Running Parallel MPI Applications:

MPI is covered in the MPI tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using MPI.

Go to either the C or Fortran versions of the MPI applications:

cd ~/linux_clusters/mpi/c  
   or
cd ~/linux_clusters/mpi/fortran

Compile (but don't run yet) any/all of the mpi_* codes with any/all of the available compilers. If you're not sure which command to use:
- See the Compilers section of the tutorial
Notes:
- If using gcc, you will need the -lmflag for several of the C examples
- Fortran - use ifort, pgf90 or gfortran (not F77 flavors)
- Consult the compiler man page(s) for any compiler flags you'd like to try
- A Makefile has been provided for convenience - if you use it, be sure to edit the choice of compiler/flags first.
INTERACTIVE RUNS:
There is a special partition setup for the workshop: pReserved. Use this partition for all exercises.
Run any/all of the codes directly using srunin the pReserved partition. For example:
```
srun -n4 -ppReserved mpi_array
srun -N2 -ppReserved mpi_latency
srun -N4 -n16 -ppReserved mpi_bandwidth   
```
NOTE: For interactive runs, if there aren't enough nodes available, your job will queue for awhile before it runs. The typical informational message looks something like below:
srun: job 68821 queued and waiting for resources
BATCH RUNS:

NOTE:This part of the exercise is trivial - it is simply shows how to submit and monitor a batch job. The batch system is covered in depth later during the Moab tutorial.
From the same directory that you ran your MPI codes interactively, open the msub_script file in a UNIX editor, such as vi (aliased to vim) or emacs.
Review this very simple Moab script. The comments explain most of what's going on. IMPORTANT:
- The executable that will be run is mpi_bandwidth . Make sure that you have created this - if in doubt, just run make.
- Make sure that you edit the path specification line in the script to reflect the directory where your mpi_bandwidth executable is located - it will differ between C and Fortran.
Submit the script to Moab. For example:
```
msub msub_script    
```
Monitor the job's status by using the command:
```
showq | grep class XX     
```
where XX matches your workshop username/token. The sleepcommand in the script should allow you enough time to do so.
After you are convinced that your job has completed, review the batch log file. It should be named something like output.NNNNN .

Building and Running Parallel Pthreads Applications:

Pthreads are covered in the POSIX Threads Programming tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using Pthreads.
cd ~/linux_clusters/pthreads . You will see several C files written with pthreads. There are no Fortran files because a standardized Fortran API for pthreads never happened.
If you are already familiar with Pthreads, you can review the files to see what is intended. If you are not familiar with Pthreads, this part of the exercise will probably not be of interest.
Compiling with pthreads is easy: just add the required flag to to your compile command.

Compiler Flag

Intel -pthread

PGI -lpthread

GNU -pthread

For example:
```
icc -pthread hello.c -o hello    
```
Compile any/all of the example codes.
To run, just enter the name of the executable.

Compiler	Flag
Intel	-pthread
PGI	-lpthread
GNU	-pthread

Building and Running Parallel OpenMP Applications:

OpenMP is covered in the OpenMP tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using OpenMP.
Depending upon your preference for C or Fortan:
```
cd ~/linux_clusters/openMP/c/  
-or-
cd ~/linux_clusters/openMP/fortran/    
```
You will see several OpenMP codes.
If you are already familiar with OpenMP, you can review the files to see what is intended. If you are not familiar with OpenMP, this part of the exercise will probably not be of interest.
Compiling with OpenMP is easy: just add the required flag to your compile command.

Compiler Flag

Intel -openmp

PGI -mp

GNU -fopenmp

For example:
```
icc -openmp omp_hello.c -o hello  
-or-
ifort -openmp omp_reduction.f -o reduction    
```
Compile any/all of the example codes.
Before running, set the OMP_NUM_THREADS environment variable to the number of threads that should be used. For example:
```
setenv OMP_NUM_THREADS 8
```
To run, just enter the name of the executable.

Compiler	Flag
Intel	-openmp
PGI	-mp
GNU	-fopenmp

Run a Parallel Benchmark:

Run the STREAM memory bandwidth benchmark:
1. cd ~/linux_clusters/benchmarks
2. Depending on whether you like C or Fortran, compile the code. Note: the executable needs to be named something other than stream, as this conflicts with /usr/local/bin/stream, an unrelated utility.
  C
  icc -O3 -openmp stream.c -o streambench
  Fortran
  icc -O3 -DUNDERSCORE -c mysecond.c ifort -O3 -openmp stream.f mysecond.o -o streambench
3. This benchmark uses OpenMP threads, so set OMP_NUM_THREADS - for example:
```
setenv OMP_NUM_THREADS 8
```
4. Then run the code on a single node in the workshop queue:
```
srun -n1 -ppReserved streambench
```
5. Note the bandwidths/timings when it completes.
6. For more information on this benchmark, see http://www.cs.virginia.edu/stream/

C	icc -O3 -openmp stream.c -o streambench
Fortran	icc -O3 -DUNDERSCORE -c mysecond.c ifort -O3 -openmp stream.f mysecond.o -o streambench

Run an MPI Message Passing Bandwidth Test:

This MPI message passing test shows the bandwidth depending upon the number of cores used and type of MPI routine used. This isn't an official benchmark - just a local test. MPI hasn't been covered yet - it will be in the MPI tutorial.

Assuming you are still in your ~/linux_clusters/benchmarks subdirectory, compile the code (sorry, only a C version at this time):
```
mpiicc -O3 mpi_multibandwidth.c -o mpitest
```
Run it using one core per node on 2 different nodes. Also be sure to specify where to send output instead of stdout:
```
srun -N2 -n2 -ppReserved mpitest > mpitest.output1
```
After the test runs, check the output file for the results. Notice how:
- Bandwidth improves with message size
- Variation in bandwidth between MPI message routines
- Variation between best / avg / worst bandwidths
To find the best (or worst) OVERALL average do something like this:
```
grep OVERALL mpitest.output1 | sort
```
You can then search within your output file for the case that had the best (or worst) performance.
Now repeat the run using all cores on 2 different nodes and send the output to a new file:
```
srun -N2 -n24 -ppReserved mpitest > mpitest.output2
```
Find the best (or worst) OVERALL average again for this run:
```
grep OVERALL mpitest.output2 | sort
```
Compare the results using 1 core per node against 12 cores per node:
```
xdiff mpitest.output1 mpitest.output2
-or-
sdiff mpitest.output1 mpitest.output2
```
Using the "avg" bandwidth per case, which performs better?

Why?

Synopsis: The type of MPI routine used, message size, the number of tasks per node, and the underlying hardware architecture all work to influence the communications throughput you can expect. Non-blocking operations with large message sizes perform best. The fewer tasks per node competing for the network adapter, the better (especially as the number of cores per node increase). Your mileage may vary.

Hyper-threading:

LC's more recent Intel clusters support hyper-threading but it is turned "off" by default. To confirm this, run the following command:
```
srun -n1 -ppReserved /usr/sbin/hyperthread-control --report
```
What does the output tell you?
Now run the following command, which uses srun's flag to turn hyper-threading "on":
```
srun -n1 -ppReserved --enable-hyperthreads /usr/sbin/hyperthread-control --report
```
What does the output tell you this time?
Moral of the story: the performance benefits of using hyper-threads will vary by application. Try your real applications both with and without hyper-threading to see which perform best.

Online Machine Status Information...and More:

Two of the most useful sources for system status and usage information are:
- MyLC User Portal: mylc.llnl.gov
- LC Home Page: computing.llnl.gov
Try the following:

Go to computing.llnl.gov. It will open a new tab/window so you can follow along with the rest of the instructions.
1. In the upper left corner, you'll notice the little green/red arrows for "System Status". Mouse-over them and select CZ Machines.
2. When prompted for your user name and password, use your class## userid and the class### PIN + OTP token for your passcode. Ask the instructor if you're not sure what this means.
3. You will then be taken to the "LC OCF CZ Machines Status" matrix. Find one of the Linux cluster machines and note what info is displayed.
4. Now actually click on the hyperlinked name of that machine and you will be taken to lots of additional information about it, including links to yet more information, which you can follow if you like.
5. Then go back to the red/green System Status arrows and select CZ File Systems. This will take you to a matrix showing details about CZ file systems.
  - This page is particularly useful for checking the up/down status of important file systems.
6. Notice that computing.llnl.gov hosts much more than machine status information. In fact, it's LC's primary user documentation portal.

Now go to mylc.llnl.gov. It will open a new tab/window so you can follow along with the rest of the instructions.
1. If you are prompted for your user name and password, use your class## userid and the class### PIN + OTP token for your passcode. Ask the instructor if you're not sure what this means.
2. The MyLC portal displays a wealth of information pertaining to LC systems.
3. Take some time to explore this information. Much of it is interactive, allowing you to dive into additional detail.
4. For example, go to the my accounts container, and click on a machine name, such as sierra. Notice the multi-tab window that appears with details on the state of the machine.