Laszlo Dobos (Eotvos University), Alexander Szalay (The Johns Hopkins University), José Blakeley (Microsoft Corporation), Tamás Budavári (The Johns Hopkins University), István Csabai (Eotvos University)
Today's scientific simulations produce output on the 10 – 100 TB scale. This unprecedented amount of data requires data handling techniques that are beyond ordinary files. Relational database systems have been successfully used to store and process scientific data but the new requirements constantly generate new challenges. Moving terabytes of data among servers on a timely basis is a tough problem, even with the newest high throughput networks. Thus, moving the computations as close to the data as possible and minimizing the client-server overhead are absolutely necessary. At least data subsetting and preprocessing have to be done inside the server process. Out of the box commercial database systems perform very well in scientific applications from the prospective of data storage optimization, data retrieval and memory management, but lack basic functionality like handling scientific data structures or enabling advanced math inside the database server. The most important gap in Microsoft SQL Server is the lack of a native array data type. Fortunately, the technology exists to extend the database server with custom-written code that enables us to address these problems.
We present the prototype of a custom-built extension to Microsoft SQL Server that adds array handling functionality to the database system. With our Array Library, fix-sized arrays of all basic numeric data types can be created and manipulated efficiently. Also, the library is designed to be able to be seamlessly integrated with the most common math libraries, such as BLAS, LAPACK, FFTW etc. With the help of these libraries, complex operations such as matrix inversions or Fourier transformations can be done on-the-fly, from SQL code, inside the database server process.
We are currently testing the prototype with a bunch of different scientific data sets: The INDRA cosmological simulation will use it to store particle and density data from n-body simulations, the Via Lactea project will use it to store Milky Way simulation data.
Paper ID: P034