GPU support¶
Overview¶
GPU support can be essential for certain algorithms and computation provided by the data tasks components. In this section a detailed explanation of how we foreseen GPUs to be integrated into the RTC tk and how it has been done in the provided example.
The use of GPUs is to be confined only to the data tasks as these implement complex algorithms that may require the power GPUs to meet deadlines. As such, it is not preferable that the GPU dev tools interfere with the entire ELT dev environment. The use of GPUs shall not add requirements or add dependencies to the ELT dev env, be it compiler versions or other.
CUDA¶
Installation¶
If GPU support is required, it is assumed that the installation is handled by the user directly. The RTC Tk alpha release assumes that the CUDA_PATH environment variable is set. This is not planned as a long term dependency but for the alpha gives an easy method of checking for the CUDA installation.
Waf Support¶
Building¶
The current implementation of waf supports the use for CUDA and will offload any CUDA files (.cu) to NVCC to compile. The designator .cu is currently caused problems when specified in wscripts when specified using sources=[]. A work around is shown below:
from wtools.module import declare_custom
def build(bld):
bld.auto_cshlib(target='exampleGpuLib',
features='cxx',
source=bld.path.ant_glob('src/*.cu'),
cxxflags=[''],
use=['cuda', 'cudart', 'cublas'])
declare_custom(provides="exampleGpuLib")
Supported versions¶
The RTC Toolkit has only been tested on version 11-1 and configured with that version. Other versions of CUDA may work but are not officially supported by the toolkit. To change the version used it maybe required to modify the wscripts.
As part of the current top-level wscript we explicitly load cublas version 11-1 but this can be modified to be generic or load different version. This is done in the line below:
cnf.check_cfg(package='cublas-11.1', uselib_store='CUBLAS', args='--cflags --libs', mandatory=False)
CUDA has only been tested with the RTC Tk on CentOS 8 and is not supported on CentOS 7.
Example Data Task GPU¶
Along side the exampleDataTask we have provided a Data Task example that uses a GPU to perform
the computation.
Source Code Location¶
The example source code can be found in the following sub-directory of the rtctk project:
_examples/exampleDataTask/gpu
Modules¶
The provided example is composed into the following waf modules:
appGpu - Mirrors the app module in app from exampleDataTask
exampleGpuLib - Provides a GPU specific code
Scripts - Provides python scripts that support testing
The GPU specific code has been confined to exampleGpuLib and the computation class in appGpu will act as an interface for this exampleGpuLib.
appGpu¶
BuisnessLogic¶
This is the BusinessLogic and computation. The BusinessLogic is identical to the one used in the non-gpu exampleDataTask just instantiates the GPU computation.
Computation¶
The computation calls functions from the exampleGpuLib that contains all the GPU specific code that requires NVCC to compile
exampleGpuLib¶
Is the example library that shows how GPU specific code can be wrapped in a library and linked to later. This is a library that provides many interfaces:
exampleGpuLib(int input_length, int output_length, int gpu);
~exampleGpuLib();
void SetMatrix(float * mat, bool flip = true);
std::vector<float> GetMatrix();
void SetAvgSlopes(std::vector<float> avg_array);
std::vector<float> GetAvgSlopes();
std::vector<float> GetResults(bool download = false);
void NewSample(const float * sample, int callback_count);
void Compute();
void initReaderThread();
The majority of these interfaces are to upload or download a specific vector or matrix to the GPU and are obvious in function.
SetMatrix has a optional bool this transposes the matrix and the RTC Tk supports only row major matrices and Nvidia only supports column major this is a simple function we hope to later incorporate into the supporting class type.
NewSample is the callback used for data processing.
void
exampleGpuLib::NewSample(const float *sample, int callback_count)
{
cudaError_t error;
error = cudaMemcpy(m_avg_slopes_d, sample, m_slopes*sizeof(float),cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
PrintCudaError(error);
CumulativeAverage_cuda<<<(m_slopes+255)/256, 256>>>(m_slopes_vector_d, m_avg_slopes_d, m_slopes, currentSample);
currentSample++;
}
This copies the data into the gpu via cudaMemcpy then triggers the
CumulativeAverage_cuda to be calculated asynchronously. This function does a cumulative
average on the incoming pixels. This would mean the slope average that is done during the
exampleDataTask compute is no longer required.