wiki:Astronomy_Use_Case

How to Design a SciDB Database

PGBWIP

In this brief note, we describe a brief "how to" on designing and creating a SciDB database. The specifics of this case come from the  Large Synoptic Survey Telescope. We're not going to implement the whole thing here. Just explore how one of their data sets can be represented and queried in SciDB.

Note that the code base I'm building this on is our 0.75 release. We're going to have that baked and served by the end of the year.

Introduction and Overview

The general area of science we care about here is astronomy. Basically, what you see when you look up at night. There's quite a lot there. And quite a lot more than you think there is. The basic Objects catalog that LSST is going to support will about 50,000,000,000 (50 billion) objects. For each of these objects, the catalog will record about 400 attributes of scientific interest.

This is all a bit much for our little exercise here. So instead, we'll focus on a smaller, public data set: the USNO-B1 catalog, which for more energetic readers with fat internet pipes  can be found here. For everyone's benefit, I've included a few lines below to illustrate what the data looks like:

# RA       DEC     A:1  2  3  4       5      6  7      8  9    10 11    12 13
0.00000, 89.09700, 0, 0, 0, 0, 1952.6, 20.47, 0, 19.37, 3, 0.00, 0, 0.00, 0
0.00000, 64.70845, 8, 3, 0, 1, 1977.5, 18.56, 0, 16.11, 9, 17.37, 10, 16.20, 8
0.00000, 42.62667, 0, 0, 0, 0, 1972.2, 20.29, 3, 19.71, 7, 16.25, 8, 15.29, 7
0.00000, 36.47284, -194, 62, -400, 38, 1983.2, 0.00, 0, 19.87, 8, 21.39, 7, 20.31, 5
0.00000, 13.51690, 0, 0, 0, 0, 1995.6, 0.00, 0, 0.00, 0, 0.00, 0, 20.64, 2
0.00000, 12.99846, -4, 4, -16, 2, 1976.7, 16.44, 4, 15.55, 8, 16.45, 2, 15.39, 7
0.00000, -27.74313, 196, 68, -402, 121, 1977.4, 0.00, 0, 19.58, 9, 20.53, 9, 20.30, 2
...

The first two columns here are the Right Ascension and Declination describing the object's position in the sky. More about these later. The other attributes in the data set are observational measurements of these objects. The table below yields a (very) brief description of the attribute and a suitable data type for each.

AttributeDescriptionType
A1Proper motion RA / yr (milli-arcsec)int32
A2error in RA pm / yr (milli-arcsec)int32
A3Proper motion in DEC / yr (milli-arcsec)int32
A4error in DEC pm / yr (milli-arcsec)int32
A5epoch of observations in years with 1/10yr incrementsdouble
A6B magdouble
A7flagint32
A8R magdouble
A9flagint32
A10B mag2double
A11flag2int32
A12R mag2double
A13flag2int32

Our goal of this exercise is to create a suitable SciDB array to hold this data, to load the data into the array, and to run a few queries. At this point we begin to explore some of the ways SciDB differs from other database engines. In SQL databases, there is only one kind of column in a table: an attribute. The key to understanding how best to use SciDB is to understand the distinction we draw between dimensions and attributes. Dimensions are similar to a SQL key. For any combination of values in the dimensions of an array, there can only be a single list of attributes. But dimensions aren't like SQL keys in that dimensions convey more information than SQL keys. When you take two lists of dimension values, you can always figure out things like how far apart the lists are. That is, SciDB's dimensions emphasize ideas like relative position, co-locality, and distance.

Relative position isn't an interesting concept for many data management applications. There is no need, in a business data processing system, to answer questions about how far apart two account identifiers are, to find the two closest departments to "Human Resources". But in many science applications co-locality is an important concept in the data model. And that's what SciDB addresses.