GPU support

Overview

GPU support can be essential for certain algorithms and computation provided by the data tasks components. In this section a detailed explanation of how we foreseen GPUs to be integrated into the RTC tk and how it has been done in the provided example.

The use of GPUs is to be confined only to the data tasks as these implement complex algorithms that may require the power GPUs to meet deadlines. As such, it is not preferable that the GPU dev tools interfere with the entire ELT dev environment. The use of GPUs shall not add requirements or add dependencies to the ELT dev env, be it compiler versions or other.

CUDA

Installation

If GPU support is required, it is assumed that the installation is handled by the user directly. The RTC Tk alpha release assumes that the CUDA_PATH environment variable is set. This is not planned as a long term dependency but for the alpha gives an easy method of checking for the CUDA installation.

Waf Support

Building

The current implementation of waf supports the use for CUDA and will offload any CUDA files (.cu) to NVCC to compile. The designator .cu is currently caused problems when specified in wscripts when specified using sources=[]. A work around is shown below:

from wtools.module import declare_custom

def build(bld):
    bld.auto_cshlib(target='exampleGpuLib',
                 features='cxx',
                 source=bld.path.ant_glob('src/*.cu'),
                 cxxflags=[''],
                 use=['cuda', 'cudart', 'cublas'])

declare_custom(provides="exampleGpuLib")

Supported versions

The RTC Toolkit has only been tested on version 11-1 and configured with that version. Other versions of CUDA may work but are not officially supported by the toolkit. To change the version used it maybe required to modify the wscripts.

As part of the current top-level wscript we explicitly load cublas version 11-1 but this can be modified to be generic or load different version. This is done in the line below:

cnf.check_cfg(package='cublas-11.1', uselib_store='CUBLAS', args='--cflags --libs', mandatory=False)

CUDA has only been tested with the RTC Tk on CentOS 8 and is not supported on CentOS 7.

Example Data Task GPU

Along side the exampleDataTask we have provided a Data Task example that uses a GPU to perform the computation.

Source Code Location

The example source code can be found in the following sub-directory of the rtctk project:

_examples/exampleDataTask/gpu

Modules

The provided example is composed into the following waf modules:

  • appGpu - Mirrors the app module in app from exampleDataTask

  • exampleGpuLib - Provides a GPU specific code

  • Scripts - Provides python scripts that support testing

The GPU specific code has been confined to exampleGpuLib and the computation class in appGpu will act as an interface for this exampleGpuLib.

appGpu

BuisnessLogic

This is the BusinessLogic and computation. The BusinessLogic is identical to the one used in the non-gpu exampleDataTask just instantiates the GPU computation.

Computation

The computation calls functions from the exampleGpuLib that contains all the GPU specific code that requires NVCC to compile

exampleGpuLib

Is the example library that shows how GPU specific code can be wrapped in a library and linked to later. This is a library that provides many interfaces:

exampleGpuLib(int input_length, int output_length, int gpu);
~exampleGpuLib();

void SetMatrix(float * mat, bool flip = true);
std::vector<float> GetMatrix();
void SetAvgSlopes(std::vector<float> avg_array);
std::vector<float> GetAvgSlopes();
std::vector<float> GetResults(bool download = false);
void NewSample(const float * sample, int callback_count);
void Compute();
void initReaderThread();

The majority of these interfaces are to upload or download a specific vector or matrix to the GPU and are obvious in function.

SetMatrix has a optional bool this transposes the matrix and the RTC Tk supports only row major matrices and Nvidia only supports column major this is a simple function we hope to later incorporate into the supporting class type.

NewSample is the callback used for data processing.

void
exampleGpuLib::NewSample(const float *sample, int callback_count)
{
   cudaError_t error;
   error = cudaMemcpy(m_avg_slopes_d, sample, m_slopes*sizeof(float),cudaMemcpyHostToDevice);
   cudaDeviceSynchronize();
   PrintCudaError(error);
   CumulativeAverage_cuda<<<(m_slopes+255)/256, 256>>>(m_slopes_vector_d, m_avg_slopes_d, m_slopes, currentSample);
   currentSample++;
}

This copies the data into the gpu via cudaMemcpy then triggers the CumulativeAverage_cuda to be calculated asynchronously. This function does a cumulative average on the incoming pixels. This would mean the slope average that is done during the exampleDataTask compute is no longer required.