Exercise 1
Preparation:
- Login to the workshop machine
The instructor will demonstrate how to do this
- After logging in, review the login banner. Specifically notice the various sections:
- Welcome section - where to get more info, help
- Announcements - All LC Machines and local machine related
- Any unread news items - trying reading a news item: news news_item
- Copy the example files
First create a linux_clusters subdirectory, then copy the files, and then cd into the linux_clusters subdirectory:
mkdir linux_clusters cp -R /usr/global/docs/training/blaise/linux_clusters/* ~/linux_clusters cd linux_clusters
- Verify your exercise files
Issue the ls -l
drwx------ 2 class07 class07 4096 Jan 15 07:31 benchmarks -rw------- 1 class07 class07 108 Jan 15 07:35 hello.c -rw------- 1 class07 class07 67 Jan 15 07:35 hello.f drwx------ 4 class07 class07 4096 Jan 15 07:31 mpi drwx------ 4 class07 class07 4096 Jan 15 07:31 openMP drwx------ 2 class07 class07 4096 Jan 15 07:31 pthreads drwx------ 4 class07 class07 4096 Jan 15 07:31 serial
Configuration Information:
- Before we attempt to actually compile and run anything, let's get familiar with some basic usage and configuration commands. For the most part, these commands can be used on any LC cluster.
- Login Nodes
Which node are you logged into? Use the nodeattr -c login command to display the all login nodes this cluster. Recall that the generic cluster login of rotates between these to balance user logins across available nodes. - Compute Nodes and Partitions
Use the sinfo -s command to display a summary of this cluster's configuration.- What compute partitions are available?
- How many nodes are in each partition? Active? Idle? Other/unavailable?
- Which nodes are in each partition?
Now try the sinfo command (no flags). Note that its output is similar to the sinfo -s command, but provides more detail by breaking out nodes according to their "state". - Batch Limits
All of LC's clusters have different batch limits. It's important to know these limits so that you don't submit jobs that request too many nodes or too many hours. Try the following command to view the limits for the machine you are logged into:news job.lim. machinename
where machinename is the actual name of your cluster.
- File Systems
Use the bdf or df -h command to view available file systems. To view the Lustre parallel file systems, pipe the output into grep lscratch. For example: bdf | grep lscratch
Job Information:
- Try each of the following commands, comparing and contrasting them to each other. Consult the man pages if you need more information.
Command Description ju
Concise summary of partitions and running jobs mjstat
Partition summary plus one line of detailed information for each running job squeue
One line of detailed information per running job showq
Show all jobs, running, queued and blocked showq -r
Show only running jobs - note additional details showq -i
Show only queued, eligible/idle - note additional details showq -b
Show only blocked jobs
Compilers - What's Available?
- Visit the Compilers Currently Installed on LC Platforms webpage.
- Look for one of LC's Linux clusters, such as cab, zin or sierra, in the section near the top of the page. Then, click on a specific cluster name/link to see additional detail for that cluster. Note that this page shows the default compilers only. Most systems have several versions of each.
- Now, try the use -l compilers command to display available compilers on the cluster you're logged into. You should see GNU, Intel, and PGI compilers - several versions of each.
- Question: Which version is the default version?
- Answer: Use the dpkg-defaults command and look for the asterisk.
- Also, try the use -l command to see the full list of all available packages. The list is pretty long, you may want to pipe it through more. Remember this command for later on - it'll come in handy on LC systems.
Hello World
- Now try to compile your serial hello.c and/or hello.f files with any/all of the available compilers. If you're not sure which command to use:
- See the Compilers section of the tutorial.
- After you've successfully built your hello world, execute it. Did it work?
Building and Running Serial Applications:
- Go to either the C or Fortran versions of the serial codes:
cd serial/c or cd serial/fortran
- Try your hand at compiling and executing any/all of the ser_* codes with any/all of the compilers available.
- Notes:
- If using gcc, you will need the -lmflag for several of the C examples
- Fortran - use ifort, pgf90 or gfortran (not F77 flavors)
- Consult the compiler man page(s) for any compiler flags you'd like to try
- A Makefile has been provided for convenience - if you use it, be sure to edit the choice of compiler/flags first.
Compiler Optimizations:
- Compilers differ in their ability to optimize code. They also differ in their default level of optimization, as demonstrated by this exercise.
- Review the optimize code and the opttest script so that you understand what's going on.
- Execute opttest. When it completes, compare the various timings.
- Which compiler performed best/worst without optimization?
- Which compiler performed best/worst with -O3?
- Which compiler had the least/greatest difference between no opt and -O3 ?
- The Intel and PGI compilers perform some optimizations by default; the GNU compilers do not. To see the effects of this, modify the opttest file to remove all occurrences of -O0and rerun the test.
Note: if you try both C and Fortran, the result differences are due to loop index variables - C starts at 0 and Fortran at 1.
This completes Exercise 1
Exercise 2
- Still logged into the workshop cluster?
If so, then continue to the next step. If not, then login as you did previously for Exercise 1.
Building and Running Parallel MPI Applications:
- MPI is covered in the MPI tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using MPI.
- Go to either the C or Fortran versions of the MPI applications:
cd ~/linux_clusters/mpi/c or cd ~/linux_clusters/mpi/fortran
- Compile (but don't run yet) any/all of the mpi_* codes with any/all of the available compilers. If you're not sure which command to use:
- See the Compilers section of the tutorial
- Notes:
- If using gcc, you will need the -lmflag for several of the C examples
- Fortran - use ifort, pgf90 or gfortran (not F77 flavors)
- Consult the compiler man page(s) for any compiler flags you'd like to try
- A Makefile has been provided for convenience - if you use it, be sure to edit the choice of compiler/flags first.
INTERACTIVE RUNS:
- There is a special partition setup for the workshop: pReserved. Use this partition for all exercises.
- Run any/all of the codes directly using srunin the pReserved partition. For example:
srun -n4 -ppReserved mpi_array srun -N2 -ppReserved mpi_latency srun -N4 -n16 -ppReserved mpi_bandwidth
NOTE: For interactive runs, if there aren't enough nodes available, your job will queue for awhile before it runs. The typical informational message looks something like below:
srun: job 68821 queued and waiting for resources
BATCH RUNS:
NOTE:This part of the exercise is trivial - it is simply shows how to submit and monitor a batch job. The batch system is covered in depth later during the Moab tutorial. - From the same directory that you ran your MPI codes interactively, open the msub_script file in a UNIX editor, such as vi (aliased to vim) or emacs.
- Review this very simple Moab script. The comments explain most of what's going on. IMPORTANT:
- The executable that will be run is mpi_bandwidth . Make sure that you have created this - if in doubt, just run make.
- Make sure that you edit the path specification line in the script to reflect the directory where your mpi_bandwidth executable is located - it will differ between C and Fortran.
- Submit the script to Moab. For example:
msub msub_script
- Monitor the job's status by using the command:
showq | grep class XX
where XX matches your workshop username/token. The sleepcommand in the script should allow you enough time to do so.
- After you are convinced that your job has completed, review the batch log file. It should be named something like output.NNNNN .
Building and Running Parallel Pthreads Applications:
- Pthreads are covered in the POSIX Threads Programming tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using Pthreads.
- cd ~/linux_clusters/pthreads . You will see several C files written with pthreads. There are no Fortran files because a standardized Fortran API for pthreads never happened.
- If you are already familiar with Pthreads, you can review the files to see what is intended. If you are not familiar with Pthreads, this part of the exercise will probably not be of interest.
- Compiling with pthreads is easy: just add the required flag to to your compile command.
Compiler Flag Intel -pthread PGI -lpthread GNU -pthread For example:
icc -pthread hello.c -o hello
- Compile any/all of the example codes.
- To run, just enter the name of the executable.
Building and Running Parallel OpenMP Applications:
- OpenMP is covered in the OpenMP tutorial later in the workshop. This part of the exercise simply shows how to compile and run codes using OpenMP.
- Depending upon your preference for C or Fortan:
cd ~/linux_clusters/openMP/c/ -or- cd ~/linux_clusters/openMP/fortran/
You will see several OpenMP codes.
- If you are already familiar with OpenMP, you can review the files to see what is intended. If you are not familiar with OpenMP, this part of the exercise will probably not be of interest.
- Compiling with OpenMP is easy: just add the required flag to your compile command.
Compiler Flag Intel -openmp PGI -mp GNU -fopenmp For example:
icc -openmp omp_hello.c -o hello -or- ifort -openmp omp_reduction.f -o reduction
- Compile any/all of the example codes.
- Before running, set the OMP_NUM_THREADS environment variable to the number of threads that should be used. For example:
setenv OMP_NUM_THREADS 8
- To run, just enter the name of the executable.
Run a Parallel Benchmark:
- Run the STREAM memory bandwidth benchmark:
- cd ~/linux_clusters/benchmarks
- Depending on whether you like C or Fortran, compile the code. Note: the executable needs to be named something other than stream, as this conflicts with /usr/local/bin/stream, an unrelated utility.
C icc -O3 -openmp stream.c -o streambench
Fortran icc -O3 -DUNDERSCORE -c mysecond.c ifort -O3 -openmp stream.f mysecond.o -o streambench
- This benchmark uses OpenMP threads, so set OMP_NUM_THREADS - for example:
setenv OMP_NUM_THREADS 8
- Then run the code on a single node in the workshop queue:
srun -n1 -ppReserved streambench
- Note the bandwidths/timings when it completes.
- For more information on this benchmark, see http://www.cs.virginia.edu/stream/
- cd ~/linux_clusters/benchmarks
Run an MPI Message Passing Bandwidth Test:
- This MPI message passing test shows the bandwidth depending upon the number of cores used and type of MPI routine used. This isn't an official benchmark - just a local test. MPI hasn't been covered yet - it will be in the MPI tutorial.
- Assuming you are still in your ~/linux_clusters/benchmarks subdirectory, compile the code (sorry, only a C version at this time):
mpiicc -O3 mpi_multibandwidth.c -o mpitest
- Run it using one core per node on 2 different nodes. Also be sure to specify where to send output instead of stdout:
srun -N2 -n2 -ppReserved mpitest > mpitest.output1
- After the test runs, check the output file for the results. Notice how:
- Bandwidth improves with message size
- Variation in bandwidth between MPI message routines
- Variation between best / avg / worst bandwidths
- To find the best (or worst) OVERALL average do something like this:
grep OVERALL mpitest.output1 | sort
You can then search within your output file for the case that had the best (or worst) performance.
- Now repeat the run using all cores on 2 different nodes and send the output to a new file:
srun -N2 -n24 -ppReserved mpitest > mpitest.output2
- Find the best (or worst) OVERALL average again for this run:
grep OVERALL mpitest.output2 | sort
- Compare the results using 1 core per node against 12 cores per node:
xdiff mpitest.output1 mpitest.output2 -or- sdiff mpitest.output1 mpitest.output2
Using the "avg" bandwidth per case, which performs better?
Why?
Synopsis: The type of MPI routine used, message size, the number of tasks per node, and the underlying hardware architecture all work to influence the communications throughput you can expect. Non-blocking operations with large message sizes perform best. The fewer tasks per node competing for the network adapter, the better (especially as the number of cores per node increase). Your mileage may vary. - Assuming you are still in your ~/linux_clusters/benchmarks subdirectory, compile the code (sorry, only a C version at this time):
Hyper-threading:
- LC's more recent Intel clusters support hyper-threading but it is turned "off" by default. To confirm this, run the following command:
srun -n1 -ppReserved /usr/sbin/hyperthread-control --report
What does the output tell you?
- Now run the following command, which uses srun's flag to turn hyper-threading "on":
srun -n1 -ppReserved --enable-hyperthreads /usr/sbin/hyperthread-control --report
What does the output tell you this time?
- Moral of the story: the performance benefits of using hyper-threads will vary by application. Try your real applications both with and without hyper-threading to see which perform best.
Online Machine Status Information...and More:
- Two of the most useful sources for system status and usage information are:
- MyLC User Portal: mylc.llnl.gov
- LC Home Page: computing.llnl.gov
Try the following:
|
|
|