R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

14 November 2016

Playing with OpenCL

I spent last week reading up on modern C++ developments, including some great essays from Herb Sutter. I was particularly struck by his prescient series on Moore's Law, The Free Lunch Is Over and Welcome to the Jungle. The latter essay portrays all possible computer architectures on a 2D plane of CPU versus memory architecture. The axes are a bit tricky, but the general idea is that a platform at the "origin" is predictable and easy to program for, whereas things get trickier as you move up and/or right.


This figure deserves describes everything from cellphones and game consoles to super computers and communications satellites. It also got me wondering how hard a simple "hello world" OpenCL program would be to get running on my Intel-only laptop. Can I do this in a day, or perhaps just an evening?

Personally, I don't want to fuss with hardware right now - I just want to see how the GPU/C++ pieces fit together. Conveniently, my laptop contains a low-end 24-core embedded GPU on its Broadwell chip. Running Debian, I was able to easily download the requisite packages and get started. I quickly discovered that consumer Intel GPUs, at present, do *not* support double anything, making this a less-than-ideal test-bed for scientific programming.

My test case was inspired by a performance issue that I ran into in my work. In a scientific simulation program that I use & help develop, Valgrind revealed that pow(double, double) was taking fully half of the total computational time. Poking around a bit, I see that pow() and log really are quite complex to compute, particularly for doubles (since the total effort is a function of precision). With this in mind, I set up a simple example using both OpenCl and straight C++, and compared timings. Note - I strongly recommend using a sample size of greather than one to draw any conclusions with real-life consequences!

In this example, the vanilla C++ is clean and easy to read, but is ~20x slower than the OpenCL version. Worried about the possibility of "unintended optimizations", I tried using a different kernel function. I used float for both examples to keep the total computational complexity the same. The speed results remained, but the new test revealed different answers. To the best of my understanding, this highlights differences in precision-sensitive operations between the OpenCL and stdc++ platforms. This is a pretty tricky area - just know that it's something to keep an eye on if you require perfect concordance between platforms.

EDIT: I also added an example using Boost.Compute today, which brings the best of both the C++ and OpenCL worlds. Boost.Compute has straighforward docs, and includes a nice closure macro that allows the direct incorporation of C++ code in kernel functions. The resulting code is *way* less verbose than vanilla OpenCL. The only downside is the extra requirement. That and some *very* noisy compiler warnings.

Here's a full example that can be found in my github test code repo (boost not shown, timings comparable with OpenCL):

$make; time ./opencl; time ./straight

g++ -Wall -std=c++11 -lOpenCL -o opencl opencl.cpp
g++ -Wall -std=c++11 -o straight straight.cpp
Using platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2

200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./opencl  0.20s user 0.09s system 97% cpu 0.295 total

200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./straight  6.03s user 0.00s system 99% cpu 6.032 total

Hopefully this example helps you get started experimenting with GPU computing. As Herb Sutter points out, we can expect more and greater hardware parallelism in the near future. Discrete GPUs are now commonly used in scientific computing, and Intel is now selling a massively-multicore add-on card, the Xeon Phi processor. Finally, Floating point precision remains an interesting question to keep an eye on in this domain.