wiki:Docs/Release_11.06/Intro_11.06

Introduction to SciDB

SciDB is designed to run on a shared-nothing cluster of commodity servers (or nodes), each with it's own local storage, and interconnected by an Ethernet network; a physical architecture also known as a Just a Bunch of Disks (JBOD) configuration. One physical server in the cluster is selected to host the SciDB coordinator instance while the others all host SciDB worker instances that perform their own localized data storage and query processing. Multiple SciDB instances may reside on a single physical server (although we do not recommend more than a single instance per physical processor). We also recommend that the SciDB executables and configuration files are stored on a single file system which is shared by all the physical servers using a network file system facility like NFS.

SciDB currently supports a single coordinator node that will run on node-0. The coordinator instance is where external client applications connect, and it is responsible for query parsing, planning and coordinating query execution operations over the collection of SciDB worker nodes. The coordinator instance can perform all the functions of a worker instance, so in a single-node, single-instance SciDB setup, the SciDB instance will run as both coordinator and worker.

Our software can be thought of as providing a kind of distributed file system, only the 'files' SciDB presents to users are logical 'arrays', and the bits and pieces of the arrays are distributed evenly over the physical nodes. To keep track of all of this information -- what arrays exist, which nodes are where, what instances are running on which nodes, how array data is partitioned over SciDB instances, and even what physical operators the instance has available to it -- we use a centralized  PostgresSQL DBMS.

What a Running SciDB Instance Looks Like

In this section we give a brief overview of what a running, multi-node SciDB instance looks like.

                               [1] Client Connection
 NOTE: In general, the iquery  +--------------------+   
       client will run on the  |       iquery       |   
       same run on the same    |   Client program.  |
       physical node as the    |(any client user ID)|
       SciDB coordinator.      +--------------------+
                                          ^
                                          |
                                          v
 [2] Meta-Data Management      [3] Coordinator Node
 +----------------------+      +--------------------+
 |      postgres        |      |       scidb        |      
 |    PostgreSQL DBMS   |<---->|    SciDB engine.   |<--.
 |   ( postgres user )  |   |  | (scidb_local user) |   |
 +----------------------+   |  +--------------------+   |
                            |                           |
NOTES:                      |  [ 4] Worker Node         |
   In theory the PostgreSQL |  +--------------------+   |
   DBMS can run anywhere.   |  |       scidb        |   |
   In practice we suggest   |->|    SciDB engine.   |<--|
   that you run it on the   |  | (scidb_local user) |   | 
   same physical node as    |  +--------------------+   |
   the coordinator.         |                           |
                           ...          ...            ...
                            |                           | 
                            |  [N] Worker Node          |
                            |  +--------------------+   |
                            |  |       scidb        |   |
                            `->|    SciDB engine.   |<--'
                               | (scidb_local user) |
                               +--------------------+
  1. We currently support a simple client side API, and a simple command line tool called iquery. When users start iquery they may supply details about the physical node and the port on which the SciDB coordinator node is running (by default on the local host and port 1239). While in theory a client program like iquery might be run from anywhere, we currently recommend that it is run on one of the physical nodes on which SciDB is installed, and use the same user ID as the scidb instances.
  2. The PostgreSQL DBMS is the meta-data management service for SciDB. It needs to be accessible for connections from all of the physical nodes in the cluster. Each SciDB instance reads from its config file details about the machine on which the Postgres DBMS is running, the port on which it is listening, and the credentials (Postgres database role and password) it will use to connect to the Postgres DBMS to access global meta data.
  3. For simplicity, it's wise to set up an account on each local node to run the local SciDB server processes, and to allow these users secure communication among the physical nodes on the cluster using OpenSSH (see below).
  4. On each physical name, the local SciDB instances will create a local directory structure to hold local data. Again, for simplicity, we recommend that you create, on every physical node, a consistently named root data directory. Upon initialization, each SciDB instance will create an independent, local sub-directory structure. (Note: This means that when two SciDB worker instances share the same physical node, they will create independent data directories beneath it.)