A bit of CUDA

So, over the last couple of months I’ve been doing a Udacity course based on parallel programming using CUDA (and thus Nvidia GPUs). This course introduced me to the basics of CUDA and how to come up with parallel solutions to everyday problems, usually solved with sequential algorithms.

Why do this course? Well, since it ties into my master’s studies. As mentioned earlier this year, I am looking at ways to process cellular automata with parallel algorithms, rather than the usual sequential solutions. A cellular automaton is fundamentally parallel, since each cell in the automaton can interact independently. Thus, one needs to find a solution that can actually exploit this behavior. This is where the GPU comes in. But how exactly?

Before I explain the the coding side of things, let’s first look at a couple of aspects regarding the hardware side of parallel programming using a typical Nvidia GPU. First off, each CUDA enabled Nvidia GPU has a total number of CUDA cores, “grouped up” onto Streaming Multiprocessors (or SMs). The GPU I am using: GeForce GT 650M has 2 SMs, each with 192 CUDA cores, for a grand total of 384. Each SM also has its own amount of shared memory that it can use during a compute procedure, and it also has access to the GPU’s global memory, where data-sets are stored. When performing work on the GPU, each SM is handed a thread-block, that contains references to a fixed number of threads. These threads are executed in parallel by the CUDA cores on the SM. Most GPUs these days can launch a maximum of 1024 threads per thread-block.
There is a lot more information that can be mined regarding one’s GPU, and this information is accessed by running deviceQuery from terminal (if you’re a Linux based user).
Secondly, in order to process a data-set, we must understand the hierarchy of this whole system. The CPU is seen as the host device: it issues instructions to the GPU, as well as copies data to and from main memory to GPU memory. The GPU is the [slave] device in this hierarchy, and it executes instructions as handed down to it by the CPU.

CPU and GPU hierarchy

Now, looking at the coding side of things, we must first select a CUDA supported language. I am using CUDA C/C++ which is a set of CUDA-accelerated libraries, compiler directives, and extensions to C and C++. The same CUDA platform has also been developed for FORTRAN. There are also third party wrappers available for other common programming languages including: Python, Perl, Java, Haskell and MATLAB (among others).

When tackling a problem in parallel, we must be aware of how the GPU will go by solving it. First, we must look at how the data-set on which the algorithm will be performed, is divided among the GPU SMs. As mentioned earlier, each SM works on a single thread-block at a time. Since a thread-block is essentially a reference to threads, we must decide on how to assign data to a single thread. For example, if we process an image, we can either assign all the pixels in the image to a single thread, assign each row in the image to its own thread, or assign each pixel in the image to its own thread. This decision is influenced by the dimension of the data-set being worked on (an image being a two-dimensional data-set). Having decided on how to allocate data to a thread for each thread-block, the entire set is then divided into a total number of thread-blocks, which will be executed, in-order, by the GPU SMs until all thread-blocks have been processed.

Now, we must take into account the hierarchy described above when implementing a solution. I have compiled my own list of instructions:

  1. Declare a SEQ data-set
  2. Allocate and assign a pointer to the SEQ result memory location
  3. Allocate and assign a pointer to the PAR input data memory location
  4. Allocate and assign a pointer to the PAR result memory location (when applicable)
  5. Load the SEQ data-set to PAR memory
  6. Set threads per thread-block and thread-block size/dimension
  7. Call KERNEL to perform PAR work
  8. Load results calculated by KERNEL from PAR memory to SEQ memory
  9. Free PAR memory used during work performed

In this list, SEQ refers to sequential operations and main memory of the Host, whereas PAR refers to the GPU or Device operations and its memory. A KERNEL refers to a C/C++ function that is executed in parallel on the GPU.

A KERNEL function is always preceded by the __global__ macro, in order for the compiler to distinguish between sequential or parallel functions. The KERNEL must always have a reference to the location of the data-set, as well as where to store the results calculated; these are parsed as pointer arguments. The KERNEL is implemented around our decision of how to allocate data to a thread. Now, to make it easy to gain access to a specific value (or a set of values) in the data-set for a specific thread, we use the built-in variables: threadIdx (a thread’s index in the thread-block) and blockIdx (a thread-block’s index in the overall grid of thread-blocks). The variable: blockDim is used when we are working with data-sets of two or higher dimensions. Once a thread is assigned its data to work on, the rest of the KERNEL function contains the algorithm to be performed on the data, and the result gained is written to the result memory location. (This is the overall strategy followed when implementing a KERNEL function.)

Currently, I am applying this knowledge to a very basic cellular automaton: Game of Life. I recently implemented a CUDA KERNEL to calculate the next generation of the cellular automaton. The idea I implemented is known as a stencil operation: each cell in the automaton is assigned to a thread, and each thread will then look at its cell’s Moore neighborhood to calculate how many neighboring cells are alive. The result is then used along with the current status of the thread’s cell in question, to calculate whether the cell will be alive or not for the next generation of the cellular automaton.

When comparing the common CPU implementation of calculating the cellular automaton’s next generation (which includes using a nested for-loop to iterate over the entire cellular automaton) to the GPU implementation described above, I’ve gained a speed-up of between 5-6 times the CPU implementation. Interestingly, even a different CUDA KERNEL I wrote, where a thread processes a row of cells, is about twice as fast as the CPU implementation. This experiment was performed on a cellular automaton of 2500×2500 cells (1000×1000 cells for the KERNEL that assigns a row to each thread). Note that the CPU I used has a clock rate of 2,4 GHz and the GPU has a clock rate of 950 MHz with only 2 SMs.

It is evident that the we get a performance increase when using the GPU. However, this experiment did not include any optimization on either the CPU or GPU side. On the CPU we can do some threading to see if there is any significant increase in performance, and on the GPU we can look at using shared memory to limit the number of GPU global memory reads and writes.

No comments yet.

Leave a comment

Leave a Reply