## Monday, April 10, 2017

GCC can offload C, C++, and Fortran code to an accelerator when using OpenACC or OpenMP where the code to offload is controlled by adding #pragma statements (or magic comments for Fortran), such as
#pragma acc kernels
for (int j = 1; j < n-1; j++) {
for (int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25f * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = fmaxf(error, fabsf(Anew[j][i] - A[j][i]));
}
}

This blog post describes what I needed to do in order to build a GCC trunk compiler with support for offloading to NVIDIA GPUs on Ubuntu 16.10.

The first step is to install the NVIDIA CUDA toolkit. Googling shows lots of strange suggestions about what you need to do in order to get this to work (blacklisting drivers, adding the PCI address of your video card to config files, etc.), but it worked fine for me to just download the “deb (local)” file, and install it as
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

The toolkit is installed in /usr/local/cuda, and /usr/local/cuda/bin must be added to PATH so that GCC may find the ptxas tool.

The script below fetches the source code and builds the compiler and tools

Add $install_dir/lib64 to LD_LIBRARY_PATH, and the compiler can now be used to offload OpenACC code by compiling as $install_dir/bin/gcc -O3 -fopenacc test.c

or OpenMP as
\$install_dir/bin/gcc -O3 -fopenmp test.c

You may need to pass -foffload=-lm to the compiler if the code you offload contains math functions that cannot be directly generated as PTX instructions.

1. Does gcc try examine code and then decide to generate accelerator or normal CPU code based upon what it thinks will be useful in any given situation? What does it ensure about semantics? I would expect that optimal performance would often be achieved by having an accelerator perform some operations in parallel with the main CPU; does gcc use "restrict" to determine when that is and is not safe?

1. GCC does not try to schedule things on the GPU by itself — you need to decorate the code using $$\verb!#pragma!$$ to tell the compiler that you intend it to run on the GPU (and that it is safe to do it).

2. Will the #pragma be taken as an "order" to use the GPU, or merely as an invitation to see if using the GPU would appear advantageous? If a piece of code would take 20us to run on the main CPU, or 5us on the main CPU plus 30us on the GPU, it would seem that using the GPU would be a win if the main CPU could overlap enough computation with the GPU that it would end up being idle for 15us or less waiting for the GPU to finish. Does gcc try to handle overlap, and if so, how?

3. It will be taken as an "order" to use the GPU, and GCC does not try to handle overlap.

2. I'm having problems to make it work. Can you help me? CUDA works on my machine. It is a CentOS, and I compiled with gcc 7.2.

It compiles the code, but when I ran it, I get:

libgomp: target function wasn't mapped

Any ideas?

When I compile without -flto flag, it doesn't compile and I get those errors.

gcc: warning: ‘-x lto’ after last input file has no effect
gcc: fatal error: no input files
compilation terminated.
lto-wrapper: fatal error: offload/wrk/install/bin/gcc returned 1 exit status
compilation terminated.
collect2: fatal error: lto-wrapper returned 1 exit status
compilation terminated.

2. I don't have any good idea what may be wrong...

But something seems strange with your installation – it should not need -flto. I'll try to figure out how it differs from my installation if you mail me (krister.walfridsson at gmail dot com) the output of
1. I do not have any good idea what may be wrong... $$\verb!libgomp.spec!$$ is supposed to be present in the same directory as $$\verb!libgomp.so!$$ (i.e. $$\verb!install_dir/lib64!$$).
You can see where GCC tries to find it if you compile using $$\verb!-v!$$ – the search path is the one shown as $$\verb!LIBRARY_PATH!$$ right before the error message.