FPChecker (or Floating-Point Checker) is a framework to check for floating-point exceptions in CUDA. It is designed as a Clang/LLVM extension that instruments CUDA code to catch floating-point exceptions at runtime.
Detectable Errors and Warnings
FPChecker detects floating-point computations that produce:
- Overflows: +INF and -INF values
- Underflows: subnormal (or denormalized) values
- NANs: not-a-number values coming, for example, from 0.0/0.0
When at least one of the threads in a CUDA grid produces any of the above cases, an error report is generated.
FPChecker also generates warning reports for computations that are close to become overflows or underflows, i.e., for computations that are x% from the limits of normal values, where x is configurable. For example, in IEEE double precision, the largest representable number is approximately 1.798e+308; a computation that produces 1e+307 may trigger a warning report.
How to Use FPChecker
FPChecker instruments the CUDA application code. This instrumentation can be executed via the clang frontend, or via the llvm intermediate representation. We call these two ways of using FPChecker the Clang version and the LLVM version, respectively.
The Clang version instruments the source code of the application using a clang plugin. The instrumentation changes every expression E that evaluates to a floating-point value, to _FPC_CHECK_(E). After theses transformations are performed, the code can be compiled with nvcc.
The LLVM version on the other hand, performs instrumentation in the LLVM compiler itself (in the intermediate representation, or IR).
Both versions have advantages and disadvantages:
- Clang version: the final code can be compiled with nvcc; however, this version can be slower than the LLVM version and requires a two-pass compilation process (i.e., first instrument using clang and then compile/link with nvcc).
- LLVM version: it is faster than the Clang version as code instrumented after optimizations are applied; however, it requires the application to be compiled completely using clang (some CUDA applications cannot be compiled with clang).
Building and Installation
To build FPChecker please follow the README file here. FPChecker is installed in CORAL systems (Lassen and Sierra) and TOSS 3 systems in the following path:
You need to load the clang/9.0.0 module and CUDA (e.g., cuda/9.x or cuda/10.x).
Using the FPChecker Clang Version
Using this version requires following two steps: (1) instrumenting the source code (with the clang plugin) and (2) compiling the code with nvcc.
Step 1: Instrumenting the source code
We provide a wrapper script called clang-fpchecker (located in the src directory) to execute this step. The wrapper script automatically calls the required options to load the plugin to instrument CUDA code. The clang-fpchecker wrapper can be used as if we are using clang to compile files. For example, suppose we are instrumenting the compute.cu CUDA file; the wrapper is called this way:
clang-fpchecker --cuda-gpu-arch=sm_60 -x cuda -c compute.cu
Note that in clang, the --cuda-gpu-arch flag specifies the compute architecture (in nvcc, this is usually set by -arch flag). The -x cuda flag indicates to clang that we are handling a CUDA file.
Also note that this step does not generate object files; we only instrument the code in this step.
After this step, floating-point expressions in compute.cu will be instrumented. For example, if an expression originally was y = a+b, it now should look like this: y = _FPC_CHECK_(a+b, ...).
Step 2: Compiling with nvcc
In this step, you compile the instrumented code with nvcc, as you regularly do. The only addition is that you need to pre-include the runtime header file using -include flag; otherwise nvcc will complain about not being able to understand the _FPC_CHECK_() function calls. You can add the following to your compilation flags to pre-include the runtime header file:
FPCHECKER_PATH=/usr/global/tools/fpchecker/blueos_3_ppc64le_ib_p9/fpchecker-0.1.1-clang-9.0.0 FPCHECKER_RUNTIME=-include $(FPCHECKER_PATH)/src/Runtime_plugin.h CXXFLAGS+=$(FPCHECKER_RUNTIME)
Requirements to Use the LLVM Version
The primary requirement for using the LLVM version is to be able to compile your entire CUDA application with the clang/LLVM compiler. Pure CUDA code or RAJA (with CUDA execution) are supported.
For more information about compiling CUDA with clang, please refer to Compiling CUDA with clang.
We have tested FPChecker so far with these versions of clang/LLVM:
- clang 9.0.1
- clang 9.0.0
Using the FPChecker LLVM Version
Once you are able to compile and run your CUDA application with clang, follow this to enable FPChecker:
1. Add this to your Makefile:
FPCHECKER_PATH = /path/to/install LLVM_PASS = -Xclang -load -Xclang $(FPCHECKER_PATH)/lib64/libfpchecker.so \ -include Runtime.h -I$(FPCHECKER_PATH)/src CXXFLAGS += $(LLVM_PASS)
This will tell clang where the FPChecker runtime is located. FPCHECKER_PATH is the where FPChecker is installed.
2. Compile your code with clang and run it.
If an exception is found, your kernel will be aborted and an error report like the following will be shown:
+----------------------- FPChecker Error Report -----------------------+ Error : Underflow Operation : MUL (9.999888672e-321) File : dot_product_raja.cpp Line : 32 +----------------------------------------------------------------------+
The current version is not MPI aware, so every MPI process that encounters an error/warning will print a report. You should include the location of mpi.h; otherwise clang may not find the MPI call definitions.
Configuration options are passed via -D macros when invoking nvcc (for the clang version) or when invoking clang (for the llvm version).
In the clang version, if you are only interested in detecting the most critical exceptions, i.e., generation of NaN and Infinity numbers, use these options: -DFPC_DISABLE_SUBNORMAL -DFPC_DISABLE_WARNINGS.
|-D FPC_DISABLE_SUBNORMAL||Disable checking for subnormal numbers (underflows)||clang|
|-D FPC_DISABLE_WARNINGS||Disable warnings of small or large numbers (overflows and underflows)||clang|
|-D FPC_ERRORS_DONT_ABORT||By default FPChecker aborts the kernel that first encounters an error or warning. This option allows FPChecker to print reports without aborting. This allows you to check for errors/warnings in the entire execution of your program.||clang, llvm|
|-D FPC_DANGER_ZONE_PERCENT=x.x||Changes the size of the danger zone for warnings. By default, x.x is 0.05, and it should be a number between 0.0 and 1.0. Warning reports can be almost completely disabled by using a small danger zone, such as 0.01.||clang, llvm|
For questions, contact Ignacio Laguna email@example.com.
FPChecker is distributed under the terms of the Apache License (Version 2.0).
All new contributions must be made under the Apache-2.0 license.