Patrick Viry (Ateji), Mathias Beck 1,2
Philippe Coucaud 3
Diego Ordonez Blanco 1,2
Laurent Eyer 2
Peregrine Park 2
1 ISDC Data Centre for Astrophysics
2 Geneva Observatory, University of Geneva
GPU accelerators can provide impressive speedups for computationally intensive algorithms, such as the identification of periodic signals, for a very good cost/performance ratio. They are however horribly complex to program, and not accessible from Java code since they do not possess all the features commonly found in CPUs that are required to run a Java virtual machine.
The goal of this project is to evaluate a novel approach based on language extensions. Parallel programming primitives are added at the language level, in the Java source code, in order for the compiler to become aware of parallel processing and thus to be able to distribute and coordinate code fragments over multiple CPUs in general, and more specifically GPUs via source-to-source translation to OpenCL. This language-based approach has been pioneered by Ateji after years of research and remains unique. Compared to more traditional approaches based on libraries or interfaces to low-level code, it brings important improvements in development time, code quality, ease of understanding and maintaining source code. It can be integrated seamlessly in the existing Java development chain.
The Gaia experiment, entirely coded in Java, was in search of a solution for improving the speed of computationally intensive code fragments without leaving the Java ecosystem, and will be one the first applications of this new tool.
The Java code for three of the computational-intensive period search algorithms (Lomb-Scargle, Deeming, and String-Length) has been made parallel using the new language constructs and first tested on multicore processors. Correctness of numerical results has been validated. Then annotations have been added to the code to offload parallel computations to the GPU. Preliminary tests show slight performance improvements.
The major bottlenecks that have been identified are (1) the Java/OpenCL interface, (2) the lack of true 2D arrays in the Java language, (3) the relative performance of OpenCL (portable) vs. CUDA (proprietary), and (4) the fact that some algorithms are inherently sequential. Additional performance should be gained with compiler improvements (1,2), a CUDA back-end (3) and algorithm choice (4).
Applications fully (re-)written in CUDA can exhibit impressive speedup figures on specific algorithms, but this comes at the cost of an important disruption in the development process. In constrast, our approach aims at bringing a good performance-to-effort ratio, with lower speedup figures but minimal changes to the Java code and the development chain. Preliminary results confirm the validity of this approach.
Paper ID: P159