wiki:Docs/Release_11.06/Install_11.06

SciDB Install Guide

SciDB is currently available for the Red Hat 5.4 and the following Ubuntu versions: 9.10, 10.04 and 10.10. SciDB was compiled and tested on other Unix and Linux variants, such as Mac OSX and Fedora 14, however we do not provide binaries for them. For installations under virtual machine, we use Oracle  VirtualBox, but other virtual machines should. Just be aware that running in a virtual machine demands quite a few resources, so unless you have a very capable machine, we suggest you only run a single instance on a virtual machine. For those using Ubuntu on VirtualBox, if you want to be able to have your Ubuntu window in full screen mode, you must install the GuestAdditions package and restart your VirtualBox.

Regardless of how you have installed Ubuntu, this section describes the list of things you need to install before you can build and use SciDB.

Preparing the Platform

Remote Execution Configuration (ssh)

Running a multi-node system like SciDB requires that the Linux user account that will be running the SciDB executable have password-less remote execution via ssh enabled between all nodes including the coordinator. We use this to initialize SciDB on all nodes, to start and stop it, and to communicate data and commands between nodes.

There are many tutorials on how to do so, for example  here and  here.

Postgres Installation, Configuration and Access Control

SciDB has been tested with Postgres 8.4.X. A suitable version of Postgres (8.4.6 or 8.4.7) is typically available on Ubuntu.

Red Hat 5.4 includes support for postgresql 8.1. We recommend upgrading to version 8.4.7. Download the rpm from the postgres.org donwload site and install using yum. Afterwards, you should initialize it by creating a postgres user (if not created during installation) and running the initdb command as user postgres.

SciDB requires the postgresql-contrib package. If your system doesn't already include this package it must be installed (apt-get or yum).

By default, Postgres is configured to allow only local access via Unix-domain sockets. In a clustered environment, the Postgres DBMS needs to be configured to allow access from the different nodes in the system.

The easy fix for that is to modify the pg_hba.conf file (usually at /etc/postgresql/8.4/main/) and add the following line:

host    all         all         10.0.0.1            trust

(assuming your local network is 10.x.x.x) and restart Postgres. This configuration might pose security issues, though. You can read  here on the security details to make a more secure installation, by listing specific host IP addresses, user names and role mappings. You might also need to set the postgresql.conf file to have it listen on the relevant port and ip address, as it might be turned off or limited to localhost by default.

After all this is done, verify that the PostgreSQL instance is running (on the same node as the SciDB coordinator). Depending on your installation of postgres, you may have to specify postgresql with a different version number, or without the version number. For example, you could use this:

sudo /etc/init.d/postgresql-8.4 status
sudo /etc/init.d/postgresql-8.4 start

if not running as root. It will be beneficial to add postgres start to the scripts running after boot to assure that Postgres will always be running.

Installing SciDB

There are two ways to install SciDB: either from a binary package, or else by building the system from source.

Build and Install SciDB from Sources

If you have chosen to install from sources, please follow the directions in this section. If you have successfully installed SciDB from a binary distribution, you can skip to the following section.

In order to build from scratch, we use the following packages:

  • cmake (2.8.3 or newer)
  • boost (we use 1.42. In newer versions some changes were made that we have not yet adjusted the code for)
  • protobuf6
  • libpqxx (3.0 or higher)
  • flex (2.5.35 or newer)
  • bison
  • log4cxx
  • apr
  • apr-util
  • cppunit
  • readline 6
  • bz2-dev
  • swig (2.0 or higher, available  here)
  • paramiko (for Python)
  • crypto (for Python)
  • subversion
  • doxygen (optional)

On Ubuntu, it is easy to install all of them:

sudo apt-get update

sudo apt-get install -y build-essential cmake libboost1.42-all-dev \
postgresql-8.4 libpqxx-3.0 libpqxx3-dev libprotobuf6 libprotobuf-dev \
protobuf-compiler doxygen flex bison  \
liblog4cxx10 liblog4cxx10-dev libcppunit-1.12-1 libcppunit-dev \
libbz2-dev postgresql-contrib  \
subversion libreadline6-dev libreadline6 \
python-paramiko python-crypto

and then download, build and install swig, as we are not aware of a pre-built package.

After downloading the SciDB tar.gz file, expand it and cd into the resulting directory. You can then build and install SciDB itself:

tar xzf scidb-11.06.tgz
cd scidb-11.06

cmake .
make 
sudo make install 

On Red Hat 5.4, it is much harder, and we do not recommend you do so. We have built quite easily on Fedora 14, so we assume that on newer versions of Red Hat it is much easier.

To start and run SciDB, you need to configure the environment of the user who will run the SciDB instances as follows. The following lines should be added to your shell's configuration file. For example if your shell is Unix bash, this goes in the .bashrc file, or the .bash_profile:

export SCIDB_VER=11.06
export PATH=/opt/scidb/$SCIDB_VER/bin:/opt/scidb/$SCIDB_VER/share/scidb:$PATH
export LD_LIBRARY_PATH=/opt/scidb/$SCIDB_VER/lib:$LD_LIBRARY_PATH

From a Binary Package

If you are installing a downloaded pre-build binary package (for example, scidb-RelWithDebInfo-10.06.0.2803-Ubuntu-10.10-amd64.deb), you can install it using dpkg for Ubuntu and rpm or yum for Red Hat. We currently provide packages for Ubuntu 9.10, Ubuntu 10.04, and Ubuntu 10.10 and RPM for Red Hat 5.4.

Ubuntu

sudo dpkg -i scidb-11.06....deb 

dpkg does not resolve dependencies, you'll need to manually install the dependencies or else use apt-get to resolve any unmet dependencies on the system.

sudo dpkg -i scidb-11.06....deb # Fails due to unmet dependencies
sudo apt-get -f install # installs all dependencies
sudo dpkg -i scidb-11.06....deb # Succeeds now

Uninstall the package using the package name scidb:

sudo dpkg -r scidb

Red Hat

One could use rpm, yum or other tools to install on Red Hat. We prefer rpm:

rpm -i scidb-11.06.???-RedHat-5.4-x86_64.rpm

and to uninstall it:

rpm -e scidb

Environment Changes

To start and run SciDB, you need to configure the environment of the user who will run the SciDB instances (we suggest create a special user scidb for this purpose). The following lines should be added to the user's shell configuration file (often .profile):

export SCIDB_VER=11.06
export PATH=/opt/scidb/$SCIDB_VER/bin:/opt/scidb/$SCIDB_VER/share/scidb:$PATH
export LD_LIBRARY_PATH=/opt/scidb/$SCIDB_VER/lib:$LD_LIBRARY_PATH

Cluster Software Installation

In a cluster environment, each and every node must have a copy of the software. One could install it on each node, however that is tedious and error prone. We recommend exporting /opt (or at least /opt/scidb/11.06) via NFS or Samba and mounting it on all nodes.

Configure SciDB

Configuring SciDB prior to initialization requires that you check that the PostgreSQL DBMS is running, and that the SciDB configuration file (usually /opt/scidb/11.06/etc/config.ini) is appropriately set up.

Metadata Catalog Initialization

The SciDB installation provides a mechanism for creating the a Postgres database to hold the SciDB meta-data catalog, but it requires that you have sudo privileges. If you cannot obtain sudo privileges for the local scidb user account on the coordinator node, ask your system administrator to run this script as the postgres user:

/opt/scidb/11.06/bin/scidb-prepare-db.sh

This script does the following:

  1. creates a new role or account (say test1user) with password (say test1passwd)
  2. creates and a database for testing scidb (say test1) using the role created in the first step
  3. creates a schema in that newly created Postgres database to hold the SciDB catalog data.

SciDB Configuration File

You would need to create a configuration file for SciDB. By default, it is named config.ini and it resides in teh etc sub-directory of the installation tree (i.e. by default it is /opt/scidb/11.06/etc/config.ini). The config file can have multiple sections, one per service instance. The configuration 'test1' below is an example of the simplest configuration: a single node, single instance system (coordinator only).

[test1]
node-0=localhost,0
db_user=test1user
db_passwd=test1passwd
install_root=/opt/scidb/11.06
metadata=/opt/scidb/11.06/share/scidb/meta.sql
pluginsdir=/opt/scidb/11.06/lib/scidb/plugins
logconf=/opt/scidb/11.06/share/scidb/log4cxx.properties
base-path=/home/scidb/data
base-port=1239
interface=eth0

The install package contains a sample configuartion file - sample_config.ini, which is used for SciDB testing, and is an example of a configuration file containing multiple configurations, one for a single instance and one for a dual instance, but on the same node.

The following table describes the config file contents and how to set them.:

KeyValue
Basic Configuration
cluster nameName of the SciDB cluster which must appear as a section heading in the config.ini file, e.g., [cluster1]
node-NThe host name or IP address used by node N and the number of worker instances on it. Node 0 always has the coordinator node running as instance 0, and may have additional worker instances running as well.
db_userUsername to use in the catalog connection string. In this example, test1user
db_passwdPassword to use in the catalog connection string. In this example, test1passwd
install_rootPath name of install root. Must be the same on all nodes. When using a multi-node environment, configuring this using NFS will reduce the number of installations needed.
metadataMetadata definition file. The recommended NFS configuration makes this visible under the same path name on master and worker processes.
pluginsdirPlugins folder - location of all plugins. Must be visible under the same path name to all workers.
logconfConfig file for SciDB log. Edit this to set a different filename or log level (default file name is INFO and default file name is scidb.log.
Cluster Configuration
base-pathThe root data directory for each SciDB instance. Note that this directory will be the same for all nodes (all instances read from the same config.ini). When each SciDB node is initialized, it creates it's own sub-directories for it's local data. E.g., if the base is /scidb, then /scidb/000/0 will hold the data and logs for the coordinator, while /scidb/001/1 will hold data for the first worker instance on the first worker node.
base-portbase port number. Connections to the coordinator (and therefore to the system) are via this number, while worker instances communicate at base-port + instance number. The default number that iquery expects is 1239.
interfaceEthernet interface used for SciDB used on all nodes - master and workers. Used to bind SciDB to the correct local interface.
ssh-port(optional) the port which ssh uses for communications within the cluster. By default it is 22.
key-file-list(optional) a comma-separated list of filenames that include keys for ssh authentication.
tmp-path(optional) a directory to use as temporary space.
Performance Configuration
save-ram(optional) 'True', 'true', 'on' or 'On' will enable this option. Off by default.
merge-sort-buffer(optional) Size of memory buffer used in merge sort. Default is 512 MB.
mem-array-threshold(optional) Memory footprint for temporary arrays. Default is 1024 MB
smgr-cache-size(optional) Size of buffer cache. Default is 256 MB

In our example shown above, the db_user field is set to test1user and db_passwd is set to test1passwd. Due to a current limitation in the system, use of the same user name with different cluster configurations might cause problems, so it is preferable to generate unique user names for each cluster configuration. These are Postgres user names, not system users. The Postgres database created is the same as the section header, test1.

Cluster Config File Example

The configuration file for a cluster is very similar to the single-node, single-instance example. Here is an example to a configuration of four nodes, each with two instances:

[cluster1]
node-0=10.0.0.1,1
node-1=10.0.0.2,2
node-2=10.0.0.3,2
node-3=10.0.0.4,2
db_user=cluster1
db_passwd=cluster1
install_root=/opt/scidb/11.06
metadata=/opt/scidb/11.06/share/scidb/meta.sql
pluginsdir=/opt/scidb/11.06/lib/scidb/plugins
logconf=/opt/scidb/11.06/share/scidb/log4cxx.properties
base-path=/mnt/scidb_data
base-port=1239
interface=eth0
ssh-port=27
key-file-list=/home/scidb/.ssh/keyfile1
tmp-path=/tmp

Notice: each node has two instances, but it is recorded as 1 for node-0. This is due to the fact that it specifies the number of worker instances, not total instances.

In addition to the extra nodes and instances, we also set three optional parameters here: ssh-port, key-file-list and tmp-file.

Initializing and Starting SciDB

Use the scidb.py script to launch SciDB. The service is referred to using its name, the section header in the config file.

Run this to initialize SciDB on all nodes. If (see previous section) the scidb user has sudo privileges, this script creates the postgres user account/role and database automatically.

scidb.py initall test1

To start the set of local SciDB instances specified in your config.ini file, use the following command:

scidb.py startall test1

This will report the status of the various nodes:

scidb.py status test1

and this will shut it down:

scidb.py stopall test1

SciDB logs are written to the file scidb.log in the appropriate directories for each instance: <base-path>/000/0 for the coordinator and <base-path>/M/N the worker node M instance N.

iquery client

iquery is the default SciDB client used to issue AQL and AFL commands. iquery connects by default to SciDB on port 1239. If you use a non-default port number, specify it using the "-p" option with iquery.

  iquery -aq "list('arrays')"
  iquery -aq "list('operators')"

The iquery executable is the basic command line tool we use. At the moment, it tries to make a connection on the local node to a standard port, where the scidb engine ought to be listening. Each invocation of iquery connects to the SciDB coordinator node, passes in a query, and prints out the coordinator node's response.

Example SciDB session

Before reviewing the complete AFL and AQL documentation it is useful to review this example SciDB session. Note that it's convenient to run the iquery tool in a new terminal window.

Basic Data Definition Language (DDL) and Data Loads

SciDB uses a language we call 'AQL' for 'Array Query Language'. In AQL, the basic structural motif is the array. To create an array in SciDB, you would use the following commands:

$ iquery -q "CREATE ARRAY  test <a: int32, b: int32 > [x=0:2,3,0, y=0:2,3,0]"
Query was executed successfully

This creates an array called test with two attributes named a and b of type int32. The new array has rank (number of dimensions) 2 and the two dimensions or indices are named x and y. These dimensions range from 0:2, have chunks (physical storage size) of 3 in both dimensions, and each chunk has 0 overlap with its neighbors in both dimensions.

At the moment, we only support bulk loading data into SciDB (you could use scidb operators to generate data, of course). We have two external formats: one suitable for dense arrays (cases where the vast majority of the elements in the array contain actual values) and one designed for sparse arrays (cases where many of the elements are absent). The idea is that you create and populate file(s) with ASCII data that comply with the rules of one of these representations and then direct SciDB to load data from those file(s).

Dense arrays have the following format:

[
[(0,0),(0,1),(0,2)],
[(1,0),(1,1),(1,2)],
[(2,0),(2,1),(2,2)]
]

And sparse arrays look like this ...

[[
{0,0} (0,0),
{0.1} (0,1),
{0,2} (0,2),
{1,0} (1,0),
{1.1} (1,1),
{1,2} (1,2),
{2,0} (2,0),
{2.1} (2,1),
{2,2} (2,2)
]]

In the dense format the dimension values are implicit in the ordering of the attributes. In the sparse format, the dimension indices are explicitly enumerated in the load file. Note also that, in the sparse format, the array is broken up into chunk-sized sections. In this example, as there is only one chunk for the entire array, there is only one block. The following example illustrates how the load file for a very sparse array might look.

[[
{5,16} (0.497321)
]]
;
[[
{2,132} (0.944702)
]]
;
[[
{0,244} (0.657221)
,
{53,255} (0.609632)
,
{68,226} (0.448509)
]]
[[
{21,451} (0.767433)
,
{44,427} (0.613046)
]]

Assuming that you have placed data in one of these formats into a file named '/tmp/data.txt', to load the data from one of these files into the new array, use the load command as shown below.

$ iquery -aq "load(test,'/tmp/data.txt')"
[[]]

Now you have a small array, loaded with data.

Running AQL and AFL Queries

You also use iquery to run queries that retrieve and manipulate array data. By default iquery accepts AQL statements. Use the "-a" switch to issue AFL queries to SciDB.

$ iquery -aq "subsample(scan(test), 0, 0, 2, 2)"

The result will be printed to stdout. Use --result[-r] option to point new filename for result file. You also can connect to remote SciDB coordinator node by using additional connection information. For complete details, see iquery -h.

End

Congratulations! You're done.