wiki:Docs/Release_11.06/ArrayFunctionalLanguage_11.06

Contents:

  1. AFL: SciDB Array Functional Language
  2. DDL
    1. 1. CREATE ARRAY command
      1. Basic examples
    2. Special array types
      1. Empty arrays
      2. Versioned arrays
      3. Immutable arrays
    3. Special dimension types
      1. Unbounded dimensions
      2. Non-integer dimensions
    4. 2. Loading Data into a SciDB Array - load
      1. Load formats
      2. Dense load format
      3. Sparse data format
      4. NULL attribute format
      5. Loading an unbounded array
      6. Loading in parallel
      7. Loading through a Unix pipe
      8. csv2scidb
    5. 3. Unload array data to a file - save
    6. 4. Building array data - build
    7. 5. Build a sparse array - build_sparse
    8. 6. Input - input
    9. 7. Store data in an array - store
    10. 8. Updates
    11. 9. Remove an array - remove
    12. 10. Rename an array - rename
    13. 11. Change the attribute names, or dimension names of an array - cast
    14. 12. Change the name of an attribute - attribute_rename
  3. Metadata Operators
    1. 1. List contents of SciDB database - list
    2. 2. Getting dimensions of an array - dimensions
    3. 3. Getting attributes of an array - attributes
    4. 4. Get public format of SciDB array - show
    5. 5. Getting versions of an array - versions
  4. DML
    1. Overview of AFL DML
    2. 1. Selecting all the data in the array - scan
    3. 2. Selecting subset of data by dimension range - subarray
    4. 3. Selecting a subset of data within a specified region - between
    5. 4. Selecting a subset of data by boolean expression - filter
    6. 5. Select from an array based on a pattern - lookup
    7. 6. Projecting (selecting) array attributes - project
    8. 7. Joining two arrays - join
    9. 8. Computing new values - apply
    10. 9. Statistical summaries - aggregate
      1. 9.1. Arithmetic mean - avg
      2. 9.2 Count of cells - count
      3. 9.3 Maximum value - max
      4. 9.4 Minimum value - min
      5. 9.5 Sum of attribute values - sum
      6. 9.6 Variance of attribute values - var
      7. 9.6 Standard deviation of attribute values - stdev
    11. 10. Reduce array by aggregating fixed subarrays - regrid
    12. 11. Full cross product join of two arrays -- cross
    13. 12. Cross product join with equality predicates -- cross_join
    14. 13. Random sample of cells/chunks from an array -- sample and …
    15. 14. Substitute NULL values -- substitute
  5. Changing array shapes
    1. 1. Adding and Deleting dimensions - adddim and deldim
    2. 2. Concatenate two arrays - concat
    3. 3. Merge attributes - merge
    4. 4. Reshape an array - reshape
    5. 5. Re-partition an array - repart
    6. 6. Transform attributes to dimensions - redimension and …
    7. 7. Expanding an element into a grid - xgrid
    8. 8. Select a subplane of the array - slice
    9. 9. Sort array by attribute - sort
    10. 10. Transform into a single dimensional array - unpack
    11. 11. Transpose an array - transpose
    12. 12. Reverse array elements - reverse
  6. Matrix and Vector Operations
    1. 1. Multiply two matrices - multiply
    2. 2. Normalize vector elements - norm
  7. Loading and Unloading Plugins
  8. Debug and Internal commands
    1. 1. Cancel a running query - cancel
    2. 2. Output a string - echo
    3. 3. Explain plan - explain_logical and explain_physical
    4. 4. Get the signature of an operator - help
    5. 5. Scatter/Gather an array - sg
    6. 6. Reduce the distribution of a replicated array - reduce_distro

AFL: SciDB Array Functional Language

SciDB provides two programming interfaces:

  • AQL, an array query language that is similar to SQL.

  • AFL, a functional language that provides the same capabilities as AQL but with a functional syntax. AFL also provides support for querying array metadata and additional array operators.

This page describes the features of AFL supported in this release.

While AFL may be revised over time as the implementation evolves and matures, the AQL interface will remain fixed with a commitment to backward compatibility.

AFL divides up into three sub-languages:

  1. Data Definition Language (DDL). The DDL allows users to create, load, remove and modify arrays and their structure in a SciDB instance.
  2. Data Manipulation Language (DML). DML allows users to query the contents of arrays in a SciDB instance.
  3. Metadata operators. These operators provide information about the state of the database, listing arrays, types, operators and so on.

This document also covers AFL operators used for debugging and Linear Algebra operators.


DDL

1. CREATE ARRAY command

The CREATE ARRAY statement is used to a create new array. The statement specifies the array schema, a description of the array properties which includes the array shape (the number of dimensions and their sizes), and the array attributes the data items that appear in each cell.

Signature:

CREATE [array_type] ARRAY 
    array_name 
    < attribute_name : type_name [ NULL | NOT NULL ] [ COMPRESSION compressor ] [ RESERVE reserve ] [, ...] > 
    [ dimension_name = start: end|*, chunk_interval, chunk_overlap [, ...] ]

SciDB arrays have three elements -- name, array type, an attribute list, and an array shape (dimensional organization).

Array FeatureOptionalDescription
array_nameNoThe string name of the array, which is an AFL identifier. The array name uniquely identifies the array in the SciDB instance. That is, you cannot use the same array name twice in one SciDB instance.
array_typeYesBy default, all arrays are updatable and versioned, with all cells present. Users can optionally specify 'IMMUTABLE' (i.e. the array cannot be updated), 'EMPTY' (i.e. some cells may be omitted), or both.
attribute_nameNoName of an attribute, an AFL identifier. No two attributes in the same array can share a name.
type_nameNoType identifier. One of the types supported by SciDB: integer types, float, double, bool, string, char. See list('types').
nullabilityYes By default, all attributes are 'NOT NULL', i.e. they will have a value. Optionally, users can specify 'NULL' to indicate attributes that are allowed to contain null values.
compressorYesThe default is 'NO COMPRESSION'. Alternatively users can specify the name of a compression algorithm. Possible values are 'ZLIB', 'BZLIB', or 'NULL FILTER'.
reserveYesThe percentage (0 to 100) of space in each chunk that is reserved for storing future updates to the chunk. The default is 20.
dimension_nameNoEach dimension has a name that is also an AFL identifier. Just like attributes, each dimension must be named, and dimension names cannot be repeated in the same array
start (integer)NoThe starting coordinate of a dimension
end (integer)NoThe ending coordinate of a dimension, or * if unknown
chunk_intervalNoThe length of the data chunk along a dimension.
chunk_overlapNoThe length of overlap along a dimension.

Basic examples

The CREATE ARRAY takes the logical array definition--a list of dimensions of the array, and for each dimension, the length or size of that dimension, its minimum and maximum values. SciDB breaks the data in it into chunks. Chunks can overlap--an important feature of SciDB--although no current operators depend on it.

We show several examples of creating arrays using the CREATE ARRAY statement.

Example One:

To create a 2D array with dimension indices

  1. having dimension names 'x' and 'y',
  2. each ranging between 0 and 49 (having length 50)
  3. and stored within 25 chunks each of size 10x10.
  4. with no chunk overlap

and a single integer attribute named 'Val'

 CREATE ARRAY Example_One < Val: int32 > [ X=0:49,10,0, Y=0:49,10,0]

Example Two:

Or a 3D array with dimension indices

  1. having names X, Y and Z ,
  2. with values for the X and Y dimensions ranging between 0 and 9 (length 10) and the Z dimension ranging over 0:99 (length = 100)
  3. stored entirely within a single chunk of size 10x10x100.
  4. with no overlap.

and two attributes A and B being a string and a double precision value:

 CREATE ARRAY Example_Two < A: string, B: double > [ X=0:9,10,0, Y=0:9,10,0, Z=0:99,100,0 ]

Example Three:

The following is a 2D array with dimensions

  1. having names I and J i with I ranging over 0 to 99 and J over 0 to 199, and
  2. broken into chunks of size 10x20

The array's cells contain a pair of attribute values, A and B, having types int32 and float.

 CREATE ARRAY Example_Three <A:int32, B:float> [I=0:99,10,0, J=0:199,20,0]

Example Four:

And finally, the following CREATE ARRAY Example_Four (which becomes the basis for much of the descriptive material that follows).

The following is a 2D array with dimensions

  1. having names I and J i with I ranging over 0 to 9 and J over 0 to 9, and
  2. fitting into a single chunk of size 10x10

Our example has two attributes, A and B, having types int32 and double.

 CREATE ARRAY Example_Four < A: int32, B: double > [ I=0:9,10,0, J=0:9,10,0 ]

CREATE ARRAY accepts two special array qualifiers to indicate the following special types of arrays -- empty and immutable.

CREATE ARRAY also supports the following special types of arrays with unbounded dimensions. The create array declarations for these special types of arrays are described below.


Special array types

Empty arrays

The EMPTY keyword in the CREATE statement indicates that some cells may be absent.

CREATE EMPTY ARRAY A_10 <x:double null, y:double null>[i=1:10,5,0]

SciDB creates an additional boolean attribute for each cell in arrays that were declared EMPTY, referred to as the emptyTag, to indicate whether a cell is present.

Note that EMPTY is fundamentally different than NULL. EMPTY is a property of an entire cell and indicates that the entire cell has been omitted, i.e. is not present. In contrast, NULL is a property of a cell value. Use cases for NULL values are described below. A cell that contains one or more null values is not considered empty. Here is an example of data output for the array A_10 above where cells at 2,4,7 and 10 contain nulls, and cells at 5 and 9 are empty:

iquery -o sparse -aq "scan(a_10)"
{1}[{1}(1,1,true),{2}(null,3,true),{3}(4,5,true),{4}(0,null,true)];{6}[{6}(3,4,true),{7}(null,null,true),{8}(6,2,true)];[{10}(null,6,true)]

If the array is not EMPTY, then unspecified cells within existing chunks assume the default value (which in the current version is 0 for number-like attributes and empty string for strings). However, chunks that are missing entirely are always treated as regions of empty cells.

Versioned arrays

By default all arrays in SciDB are updatable (as opposed to IMMUTABLE) and versioned. The SciDB storage manager uses a "no overwrite" model in which chunks are never overwritten. Rather, updates cause new versions of the array to be created.

Array chunks can be updated by commands that write to array storage, namely load, store, and redimension_store.

Array versions are user-visible and operators may obtain prior versions of an array using a version qualifier. Both of these are discussed in the sections on versions and scan.

SciDB does not allow the array type (EMPTY vs. non-EMPTY) and (IMMUTABLE vs. updatable) to be altered after array creation. However, array data can be copied from one type of array to another with suitable transformations. These are discussed in the store operator.

Immutable arrays

Immutable arrays in SciDB are used when data will not or cannot change. The example below shows how to create an immutable array. Immutable arrays are expected to be rare in SciDB, but have reduced storage overhead relative to updatable arrays. Immutable arrays do not have versions. Only one write is allowed per chunk for such arrays even if these chunks are loaded using multiple load commands.

CREATE IMMUTABLE ARRAY IM <a: double > [x=0:5,3,0, y=0:5,3,0]

An array that is both EMPTY and IMMUTABLE must be declared as:

CREATE IMMUTABLE EMPTY ARRAY IE <a: double> [x=0:5,3,0, y=0:5,3,0]

Special dimension types

Unbounded dimensions

An array dimension can be created as an unbounded dimension by declaring the high boundary as open using '*'. Examples of unbounded dimensions include I=0:*,10,0 or J=-7:*,100,0. In the first example, I has a low boundary of 0 and an open high boundary and J has a low boundary of -7 and an open high boundary.

A regular array does not allow chunks to be loaded into chunk addresses that fall outside the region defined by the create array definition. However, with an array having unbounded dimensions, chunks can be incrementally loaded into the array into new chunks of the array and there is no limit on the array dimensions, other than storage resources available in the system. The array boundaries are dynamically updated as new data is added to it.

CREATE ARRAY open_array <a:int64>[x=0:*,5,0]

An example of incremental loading into an unbounded immutable array is described in the load section. Loading data into an unbounded array causes new array versions to be created, whereas no additional array versions are created when loading an unbounded immutable array.

Non-integer dimensions

Regular arrays in SciDB use the int64 data type for dimensions. SciDB also supports arrays with non-integer dimensions. These arrays map dimension values of a declared type to an internal int64 array position. Mapping is done through special mapping arrays internal to SciDB.

Below is an example of an array with a non-integer dimension:

CREATE ARRAY non_int_array <a:int64>[ID(string)=10,5,0]

This command will create an array with a non-integer dimension, named ID, that has 10 unique string dimension values mapped internally to positions 0,.., 9, and uses chunks of 5 cells, each with no overlap.

The most common way to store data into such an array is to use a redimension_store. However, any output array that has non-integer dimensions compatible with this array can be stored into it. See the section on this command for more details.

Mapping arrays

SciDB stores the mapping from value to position for these special indices using a mapping array. The name of the mapping array is "array@VERSION:dimname". This mapping array is used by SciDB array operators to translate from dimension value to an integer dimension in the array. The mapping array is fully replicated at every node in a cluster SciDB configuration. They are not intended for general purpose use in AFL commands. However they do appear in the list('arrays') output with special array names.

For example, when data is stored into the above array, we see the following new mapping array created for it. This array maps each string value to a corresponding integer array position.

 show("non_int_array@1:ID");
[("non_int_array@1:ID<value:string NOT NULL> [no=0:9,10,0]")]

If the base array is created as a standard versioned array, the corresponding mapping arrays are also versioned. Hence, the mapping array shown above corresponds to array version 1, and array version 2 of non_int_array has mapping array non_int_array@2:ID.


2. Loading Data into a SciDB Array - load

SciDB currently supports the load operator for loading data into a SciDB array.

Signature:

load (array_name : array_identifier, data_file : string [, nodeid : int] )

array_name is the array to be loaded. data_file is the path to a file to use -- absolute or relative to the working directory of the SciDB server. For a description of nodeid, see below section on parallel load.

SciDB data load files must be organized according to the schema of the target array.

Given an array defined as follows:

CREATE ARRAY Load_Example < Val: int32 > [X=1:100,25,0, Y=1:100,25,0]

Load files divide the data to be loaded into chunks. Load file chunks correspond to array chunks as defined by the CREATE ARRAY statement. So, for the example 2-dimensional array with the following chunk layout

C11 C12 C13 C14
C21 C22 C23 C24
C31 C32 C33 C34
C41 C42 C43 C44

The chunks in the load file must appear in the following order:

C11; C12; C13; C14; C21; C22; ...; C41; C42; C43; C44

Each chunk may also be prefixed with an optional chunk header that lists the dimension values of the starting element in that chunk. If this chunk header is not present, the file is assumed to contain all of the chunks in the order described above.

Load formats

SciDB supports two chunk formats in load files corresponding to dense and sparse data sets. The sparse format is more efficient when a majority of the cells in the array are absent and do not contain attribute data. A SciDB array loaded from a data file that has the sparse load format generates sparse chunks in array storage, whereas one loaded from a dense file format creates chunks that use the dense storage format.

Dense load format

In the following example we illustrate how a multi-chunk, dense array is created and loaded.

CREATE ARRAY Two_Dim<a: int32, c: char>[I=0:7,4,0, J=0:7,4,0]
load(Two_Dim, '/tmp/2d-mc.txt')

In the dense representation, the dimension values for each cell are implicit in the representation. Data is divided up into chunks, with each chunk enclosed within a '[ ]' and separated by a semi-colon.

Cells within each chunk must appear in left to right dimension order (e.g., row-major order for a two-dimensional array, or generalized appropriately to higher dimensions). Each cell contains a comma-separated list of attribute values placed within (). This 'dense' representation can denote empty cells using '()'.

Contents of /tmp/2d-mc.txt:

[
[ (0, 'A'), (1, 'B'), (2, 'C'), (3, 'D')],
[ (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L')],
[ (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T')],
[ (24, 'Y'), (25, 'Z'), (26, 'A'), (27, 'B')]
];
[
[ (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H')],
[ (12, 'M'), (13, 'N'), (14, 'O'), (15, 'P')],
[ (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X')],
[ (28, 'C'), (29, 'D'), (30, 'E'), (31, 'F')]
];
[
[ (32, 'G'), (33, 'H'), (34, 'I'), (35, 'J')],
[ (40, 'O'), (41, 'P'), (42, 'Q'), (43, 'R')],
[ (48, 'W'), (49, 'X'), (50, 'Y'), (51, 'Z')],
[ (56, 'E'), (57, 'F'), (58, 'G'), (59, 'H')]
];
[
[ (36, 'K'), (37, 'L'), (38, 'M'), (39, 'N')],
[ (44, 'S'), (45, 'T'), (46, 'U'), (47, 'V')],
[ (52, 'A'), (53, 'B'), (54, 'C'), (55, 'D')],
[ (60, 'I'), (61, 'J'), (62, 'K'), (63, 'L')]
]

For the simpler case of a one-dimensional array, the dense load file format looks like this:

[ (36, 'K'), (37, 'L'), (38, 'M'), (39, 'N')]; 
[ (44, 'S'), (45, 'T'), (46, 'U'), (47, 'V')];
[ (52, 'A'), (53, 'B'), (54, 'C'), (55, 'D')];
[ (60, 'I'), (61, 'J'), (62, 'K'), (63, 'L')] 

Sparse data format

The sparse data format is created as convenience - so that a large number of cells can be unspecified.

The sparse load format lists the data by chunks--there is a semi-colon between the chunks--and within each chunk the data is organized as a list of comma-separated cells, where each cell includes the coordinates and the attributes of the cell.

For example:

create array sparse_example<a:double> [x=0:8,2,0,y=0:3,2,0]

This array is a 2-D array contained within several 2x2 chunks. The following is an example file '/tmp/sparse_load.txt':

cat  /tmp/sparse_load.txt
[[ 
{0,0} (11)
{1,0} (21)
{0,1} (12)
]];
[[ 
{0,2} (13)
]];
[[ 
{2,0} (31)
{3,0} (41)
{2,1} (32)
{3,1} (42)
]];
[[ 
{2,2} (33)
{3,3} (44)
]];
[[ 
{7,0} (81)
{6,1} (72)
{7,1} (82)
]];
[[ 
{6,2} (73)
{7,2} (83)
{7,3} (84)
]];
[[ 
{8,0} (91)
]];
[[ 
{8,2} (93)
{8,3} (94)
]]

You would use the following load syntax to load this file into the sparse_example array:

 load(sparse_example, '/tmp/sparse_load.txt')

load() is implemented in SciDB as a combination of input -> store. These operators are discussed in later sections of the document.

Notes:

  1. The last chunk in the load file should not have a ';' chunk delimiter.
  2. The SciDB loader ignores additional newline or space characters in the load file.
  3. Choosing a particular load file format has no bearing on how data is stored internally in SciDB.
  4. In addition to storing the data, load and store operators also return the data back to the client (or next operator in the query). When the data is voluminous, this may not be desired and the output of the query should be suppressed. For instance, the iquery executable that accompanies SciDB includes the "-n" option for this purpose.
  5. Note that unspecified cells can have various interpretations, depending on how the array is declared. See the section on empty arrays.

NULL attribute format

Both the dense and sparse array formats can include special codes for NULL attributes. For example, if a faulty instrument occasionally fails to report a reading, that attribute could be represented in a SciDB array as NULL. If an erroneous instrument reports readings that are out of valid bounds for an attribute, that may also be represented as NULL.

NULL must be represented using the token 'null' or '?' in place of the attribute value.

In addition, NULL values can be tagged with a "missing reason code" to help a SciDB application distinguish among different types of null values -- for example, assigning a unique code to the following types of errors: "instrument error", "cloud cover", or "not enough data for statistically significant result". Or, in the case of financial market data, data may be missing because "market closed", "trading halted", or "data feed down".

Missing reason codes allow an application to optionally treat each kind of null as a special case, e.g. to supply or calculate a context-sensitive default value.

The examples below show how to represent missing data in the load file. ? or null represent null values, and ?2 represents null value with a reason code of 2.

[[ ( 10, 4.5, "My String", 'C'), (10, 5.1, ?1, 'D'), (?2, 5.1, "Another String", ?) ...

or

[[ ( 10, 4.5, "My String", 'C'), (10, 5.1, ?1, 'D'), (?2, 5.1, "Another String", null) ...

The substitute operator described later in this document can be used to replace missing values with user-defined values looked up from another SciDB array.

Loading an unbounded array

An unbounded array is one declared to have one or more unbounded dimensions. With an unbounded dimension there is no declared limit on the high value. SciDB allows an array to be expanded along unbounded dimensions and new chunks can be appended to an unbounded array after the initial load beyond the current maximum size of the dimension.

CREATE ARRAY open_array <a:int64>[x=0:*,5,0]

Load data in three phases into open_array described in the create statement above.

load(open_array, "/tmp/load_1.txt")
load(open_array, "/tmp/load_2.txt")
load(open_array, "/tmp/load_3.txt")

The load files are shown below and described in more detail in the sections on input and load. Initial load from /tmp/load_1.txt

[(0), (1), (2), (3), (4)];[(5), (6), (7), (8), (9)]

Additional loads:

/tmp/load_2.txt
{10}[(15), (16), (17), (18), (19)]
/tmp/load_3.txt
{15}[(10), (11), (12), (13), (14)]

Loading in parallel

The optional nodeid parameter instructs the load command to open and load data from a particular instance of SciDB. Possible nodeid values are:

nodeidDescription
0Coordinator
1..NNode id of a node in the cluster, obtained form the system catalog table node
-1All nodes in the cluster

Specifying "-1" for the nodeid causes a parallel load to all nodes in the cluster. The file path of the load file must be the same on all nodes in the cluster, but each node must be given distinct chunks to load. If multiple nodes attempt to load the same chunk during the, the load command will fail. Conversely, if any particular node cannot open the file, the load will continue after logging a warning.

If your SciDB installation has 4 nodes (say, node1, node2, node3, and node4) and the load must load 20 chunks, you can place chunks 1-5 on node1, chunks 6-10 on node2, and so on. The following load command will simultaneously load all 20 chunks into the array in roughly 1/4th the time compared to a coordinator load.

load (Array, '/tmp/load.data', -1);

Loading through a Unix pipe

All the above methods of loading SciDB can also be done via a Unix pipe (instead of load files). Pipes avoid creating and storing SciDB load files from source files.

csv2scidb

CSV is a common data exchange format. Many systems produce data in CSV format. SciDB provides a utility called csv2scidb to convert a csv file into a 1-dimensional array. This array must later be transformed using other SciDB operators such as redimension and redimension_store discussed later in this document.

Consider the csv file /tmp/observations.csv:

String_One,15354,01-01-2005 10:11:32,31.7257
String_One,15354,01-01-2005 10:11:35,404.0464
String_One,15354,01-01-2005 10:11:38,926.4216
String_One,,01-01-2005 10:11:41,16.7285
...

csv2scidb can be used to re-format this file into a dense SciDB load file ready to be loaded into the 1-dimensional array Raw_Load.

iquery -anq "create array Raw_Load<s: string, dt: dateTime, r: double>[1:*, 134340,0]"
mkfifo /tmp/load_pipe
cat /tmp/observations.csv  | awk -F, '{print $1","$3","$4}' | csv2scidb -c 134340 -p SSN > /tmp/load_pipe & 
iquery -anq "load (Raw_Load, '/tmp/load_pipe')"

The following arguments to csv2scidb are shown here. -c is used to specify the chunk size, and -p is used to specify a parse format for the file (S = string, N = numerical, C = char). See csv2scidb --help for more details on usage.


3. Unload array data to a file - save

Use the save operator to unload the data to an external file on the file system. The format of the file is the same as the load file format described above. If the internal chunk storage format of the array is dense, this file is unloaded into a dense file format. And if the array chunk storage format is sparse, the save output is in sparse file format.

Signature:

save( array, file_path : string )

Array name refers to the array to be unloaded and saved. File path is the path to a file to use -- if a relative path name is used this is assumed to be relative to the working directory of the SciDB server.

In a cluster configuration, save operates as follows. For each node in a cluster save creates a file with the same file name containing only the chunks stored on that node. The entire array data is obtained by concatenating saved files from all nodes.


4. Building array data - build

Return a dense single-attribute array, assigning values to its attributes using an expression. Expressions can refer to one of the functions supported in SciDB, and to dimension values.

Signature:

build( array : array_identifier | anonymous_schema, expression )

The build operator's first argument is the name of an array to be used as the template for the operator's result. build proceeds through the array, cell by cell, using the value of the exp : expression to compute the value of each cell. The expression can use any of the array's dimension values. On a cluster configuration, build works in parallel on all nodes.

The first example creates an identity matrix:

CREATE ARRAY Build_Example_INT < A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ]

build (Build_Example_INT, iif(X=Y, 1, 0))
[
[(1),(0),(0),(0),(0)],
[(0),(1),(0),(0),(0)],
[(0),(0),(1),(0),(0)],
[(0),(0),(0),(1),(0)],
[(0),(0),(0),(0),(1)]
]

Another example shows how to build an array of monotonically increasing values from an anonymous schema:

build (<val:int64> [x=1:4,4,0,y=1:4,4,0], x*4+y-5)"
[
[(0),(1),(2),(3)],
[(4),(5),(6),(7)],
[(8),(9),(10),(11)],
[(12),(13),(14),(15)]
]

NOTE: The build operator does not store data and does not change the given array - it only uses the given shape to generate and return a result. To store the result of the build operator you will need to use the store operator (see below).


5. Build a sparse array - build_sparse

build_sparse works similarly to build but returns an EMPTY type array and accepts a second expression that must be boolean.

Signature:

build_sparse( array : array_identifier | anonymous_schema, exp : expression, bexp : expression )

The output of build_sparse contains empty cells wherever bexp evaluates to false. Both expressions can access any of the array coordinates. The following example returns a diagonal matrix:

build_sparse(<a1: double> [x=0:2,3,0, y=0:2,3,0], 1.0*x+100.0*y, x=y)

6. Input - input

Input is used to read a file from the filesystem, interpret the contents as cells in an array and return the result. Input has the exact same signature as load.

Signature:

input (array_name : array_identifier, data_file : string [, nodeid : int] )

Input works exactly the same way as load, except it does NOT store the data. The store operator can be used to persist the output of input, as in:

store ( input (a, 'somefile' ), a) -- equivalent to -- load ( a, 'somefile' )

or even

store ( input (a, 'somefile' ), b)

7. Store data in an array - store

The store operator takes the result of an AFL expression, and stores the result in the target array. If the target array does not exist, a new array is created using the schema of the input.

Signature:

store( array, target_array : array_identifier )

The store operator takes the input array and stores the result in the target_array. If target_array already exists and the schema of the input matches that of the target, the result is stored to it. Otherwise, target_array is created using the schema inferred from the input.

For example:

store ( build ( < A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ] , 5*X + Y ), New_Build_Array)

8. Updates

Store is one of the AFL commands that can update an array. Unless the array is immutable, each execution of store causes a new version to be created. When an array is removed, so are all its versions. The example below shows the contents of file /tmp/dense2.txt used in the AFL examples below.

{0,3}
[
[(140),  (150),  (160) ],
[(240),  (250),  (260) ],
[(340),  (350),  (360) ]
];
{3,0}
[
[ (410),  (420),  (430) ],
[ (510),  (520),  (530) ],
[ (610),  (620),  (630) ]
]

The input file shown above contains 2 chunks at {0,3} and {3,0} that can be used to update a previously loaded array using:

store(merge(input(updarr, '../tests/basic/data/dense2.txt'), updarr), updarr)

This creates a new version of the updarr array.

Array versions are accessible from the AFL scan operator (see description below) using a version identifier. Versions can be identified either using the version number integer, or a timestamp. By default, the array name without a version identifier refers to the latest version of the array.

scan(updarr)
scan(updarr, now()-5)

Enumerate the versions of the updatable array using the versions operator.

project(versions(updarr), version_id)

To access older versions of the array, attach the suffix @timestamp to the array name. Use datetime to create a valid SciDB timestamp from a string representation of the timestamp. SciDB timestamp format is "12/29/10 12:57:24".

The following example shows how to access array versions using timestamps. This query returns the chunks of the array as of the timestamp '02/20/2002'.

scan(updarr@datetime('02/20/2002'))

If the array is immutable, store may only be used to add new chunks in places where chunks do not already exist. Using store to write over already existing chunks in an immutable array causes an error.

For example:

create immutable array a <a:double, b:double> [x=1:40,10,0];

cat /tmp/data1
{1}[(1,2),(3,4),(5,6),(7,8)];

store (input (a, '/tmp/data1'), a);
[(1,2),(3,4),(5,6),(7,8),(0,0),(0,0),(0,0),(0,0),(0,0),(0,0)]
--Note: the tail end of the chunk is not specified; the array is not EMPTY, therefore the unspecified values assume 0.

cat /tmp/data2
{11}[(9,10),(11,12),(13,14),(15,16)];

store (input (a, '/tmp/data2'), a);
[(1,2),(3,4),(5,6),(7,8),(0,0),(0,0),(0,0),(0,0),(0,0),(0,0),(9,10),(11,12),(13,14),(15,16),(0,0),(0,0),(0,0),(0,0),(0,0),(0,0)]
--Add another chunk to the array

store (input (a, '/tmp/data2'), a);
--error: chunk already present

9. Remove an array - remove

Removes an array from the SciDB instance.

Signature:

remove( array_identifier )

The only argument to the remove operator is the name of the array to be removed. The base array and all its versions are removed. If mapping arrays were created for any non-integer dimensions, they are also removed.

For example:

list ('arrays')
["Load_and_Append"),("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]

remove (Load_and_Append)
Query was executed successfully

list ('arrays')
[("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]

NOTES:

  1. Deleted array names can be re-used.
  2. See below for AFL operators to interrogate the SciDB instance about the shape, attributes and contents of an array.
  3. The remove operator currently does not reclaim storage.

10. Rename an array - rename

Rename an existing array in the database.

Signature:

rename( old_name : array_identifier, new_name : array_identifier )

The first argument is the name of the array to be renamed, and the second argument is the new array name.

Example:

 store ( build (< A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ], iif(X=Y,1,0)), rename_example);
 list ('arrays');
 rename (rename_example, new_name);
 project (new_name, A);

[[(1),(0),(0),(0),(0)],
[(0),(1),(0),(0),(0)],
[(0),(0),(1),(0),(0)],
[(0),(0),(0),(1),(0)],
[(0),(0),(0),(0),(1)]]

NOTE:

  1. Once a rename is done, the old array name can be re-used. Any mapping arrays are also renamed.
  2. remove and rename are unsafe when concurrently with other queries on the same arrays.

11. Change the attribute names, or dimension names of an array - cast

The input array and template arrays should have the same numbers and types of attributes and the same numbers and types of dimensions.

This operator allows renaming an array or any of its attributes and dimensions. A single cast invocation can be used to rename multiple items at once -- one or more attribute names and/or one or more dimension names.

Note that the template-array need not have actual data and is only used as a template from which attribute names and dimension names are used. Note that the input array is not modified, rather, a new array is created as a result of the cast operator. To store this newly created array use the store operator.

Signature:

cast ( array, template : array_identifier | anonymous_schema )

One important application of cast is the resolution of naming conflicts:

store(build (<a:int64> [x=0:10,1,0], x), vector)
[(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)]

store(build (<a:int64> [x=0:10,1,0], -x), vector2)
[(0),(-1),(-2),(-3),(-4),(-5),(-6),(-7),(-8),(-9),(-10)]

apply(join(vector,vector2), b, a+a)
-- ERROR: a is ambiguous

apply(join(vector, cast (vector2, <b:int64>[x=0:10,1,0])), c, a+b)
[(0,0,0),(1,-1,0),(2,-2,0),(3,-3,0),(4,-4,0),(5,-5,0),(6,-6,0),(7,-7,0),(8,-8,0),(9,-9,0),(10,-10,0)]

12. Change the name of an attribute - attribute_rename

Works similarly to cast but only changes the name of an attribute. The new name is used in the result array, and the source array is not changed.

Signature:

attribute_rename ( array, old_name : attribute_identifier, new_name : attribute_identifier )

Example:

attribute_rename(m3x3,val,foo)

Metadata Operators

AFL provides a number of mechanisms for getting information about the database it is managing. These facilities all interrogate the SciDB internal meta-data, and return an array object which can be subject to further query expressions.

1. List contents of SciDB database - list

The list operator allows users to get a list of operators, types or the names of arrays in the SciDB instance.

Signature:

list ( code : string )

The code : string parameter value is one of the following: 'arrays', 'functions', 'types', 'operators', or 'queries'. If called without any parameters, list will return a list of all arrays created in SciDB.

Example:

list() 
list('arrays')
[("Reads"),("Load_and_Append"),("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]
list('functions')

enumerate the list of functions available for use in expressions.

FunctionsDescription
<=, <>, <, =, >=, >, -, /, *, %, +, and, or, notarithmetic and logic operators
abs, acos, asin, atan, ceil, cos, exp, floor, log, log10, pow, sin, sqrt, tanmathematical functions
double, float, int16, int32, int64, int8, uint8, uint16, uint32, uint64, stringtype conversion
nowget current time (local time zone)
first, last, length, lowdimension interrogation
nodeidget current nodeid, useful in determining array distribution
regex, strchar, strftime, strlen, substrstring manipulation functions
list('operators')

enumerates the list of operators available on this server. User-defined operators appear along with built-in operators in this list.

list('types')

enumerates the list of known types. User-defined types registered with the server appear along with standard built-in types.

[("bool"),("char"),("double"),("float"),("indicator"),("int16"),("int32"),("int64"),("int8"),("string"),("uint16"),("uint32"),("uint64"),("uint8")]
list('queries')
name,query_id
"create array A <a:int64> [x=0:2,3,0]",3338107939
"remove(B)",3338227944
"select * from A a1 join A a2 on a1.a=a2.a",3338257950
"drop array simplearray",3352140153
"list('arrays')",3352150179
"list('arrays')",3352150185
"list('queries')",3371118518

Note that list returns a SciDB array. So it can be used as the input to other data management operators like filter, project.

The following query returns all of the functions in SciDB named 'regex', together with their function signatures:

filter(list ('functions'), name='regex');

The following query uses the SciDB regex() function to filter the functions to find only those that take as input a pair of strings:

filter (list ('functions'), regex(profile, '(.*)(string,string)(.*)'));

And finally, the following query filters out only functions that take a pair of strings as input and compute some boolean result (like "regex"). Then it sorts that result by the function's name (the first attribute produced by list('functions')), and finally projects out only the function's signature:

project (
    sort (
        filter (
            list ('functions'),
            regex(profile, '(.*)bool(.*)(string,string)(.*)')
        ),
    1),
    profile
);

2. Getting dimensions of an array - dimensions

Get a list of array dimensions.

Signature:

dimensions ( array_identifier )

The argument to the dimensions operator is the name of the array. It returns an array with the following attributes: name, start, length, chunk-interval, chunk-overlap, low-boundary, high-boundary.

Examples:

dimensions (Example_Three)

[("X",0,10,10,0,0,10),("Y",0,10,10,0,0,10),("Z",0,100,100,0,0,10)]

For a regular array defined using a create array statement with fixed dimensions the start and length are set to the values specified in the CREATE ARRAY statement.

The low and high boundaries represent the current occupancy of cells within the array. Low-boundary represents the lowest dimension value occupied by any element in the array and high-boundary represents the highest dimension value occupied by an array element.

NOTE:

  • The length field for an unbounded dimension is set to the maximal signed 64-bit integer, namely, 9223372036854775807.
  • The low and high boundaries for a newly created array that does not yet have any data is also set to Undefined values recorded as the following special constants (MAXINT64 - 1) and -(MAXINT64-1) respectively. These constants are interpreted as "Undefined" for array dimension values into which no data has been loaded.

3. Getting attributes of an array - attributes

Get a list of array attributes. Returns an array with the following attributes: name, type, and boolean flag. The boolean flag represents nullability of the attribute and is True if the attribute can be set to NULL.

Signature:

attributes ( array_identifier )

The argument to the attributes operator is the name of the array.

Examples:

 CREATE ARRAY Attr_Example_One < A: int32, B: string > [ I=0:9,10,0 ]
 attributes ( Attr_Example_One )
 [("A","int32",false),("B","string",false)]

CREATE ARRAY Attr_Example_Two < A: string, B: double, C: int32 > [ I=0:9,10,0 ]
attributes ( Attr_Example_Two )
[("A","string",false),("B","double",false),("C","int32",false)]

Attributes are only available for stored arrays. Result arrays returned by an AFL command may not be used as input to this operator.


4. Get public format of SciDB array - show

This operator combines the information about the named array's attributes and dimensions, formatting it's result as it would appear in a create array ... statement.

Signature:

show ( array_identifier )

Example:

CREATE ARRAY Attr_Example_Two < A: string, B: double, C: int32 > [ I=0:9,10,0 ];
show ( Attr_Example_Two )
[("Attr_Example_Two 8<A:string NOT NULL,B:double NOT NULL,C:int32 NOT NULL> [I=0:9,10,0]")]

5. Getting versions of an array - versions

Get a list of array versions for an array which has been declared as updatable.

Signature:

versions ( array_identifier )

Example:

versions(updarr)

[(1,"12/31/10 11:03:10"),(2,"12/31/10 12:22:17"),(3,"12/31/10 12:22:19")]

The output of the versions command is a list of versions, each of which has a version ID and a datestamp which is the date and time of creation of that version.


DML

In this section we review the syntax and the operators that make up the AFL Data Manipulation Language.

Overview of AFL DML

The DML portion of SciDB AFL adopts a functional programming language syntax over an underlying algebra of array operators. The general form of a DML expression in AFL DML will look like this.

  operator_B ( operator_A ( array_name ), constants and expressions )

AFL query expressions should be read from the inside out. In this example, the first operator executed is the inner-most--operator_A--and it is applied to an array stored in the SciDB database named 'array_name'. Most SciDB AFL operators take (at least one) array inputs, and produce (exactly one) array output. In this example, the output of 'operator_A ( array_name )' is another logical array that becomes the input to the outermost operator, operator_B. The output of B is then returned to the client issuing the query. Whenever an operator signature uses an array input - it implies that the input to the operator can be a stored array (using an array identifier) or the output of another operator. In addition to arrays, some operators in AFL can take other arguments - constants, expressions, or identifiers.

NOTE: DML operators leave the database unchanged. Only STORE and LOAD operators may add stored data to new or existing arrays.


1. Selecting all the data in the array - scan

Scan returns the contents of the array whose name is passed in as an argument.

Signature:

scan( array_identifier )

Example:

scan(my_array)

The 'scan' operator expects an array identifier as its only argument, and returns an array as a result of operation. This operator is only used to get a complete listing of the contents of an array stored in the SciDB database.


2. Selecting subset of data by dimension range - subarray

Subarray selects a block of cells from an input array. The result is an array whose shape is defined by the 'bounding box' specified by the subarray.

Signature:

subarray ( array, low-boundary-coordinate-1: value, ..., low-boundary-coordinate-N: value, high-boundary-coordinate-1: value, ..., high-boundary-coordinate-N: value )

The first argument to the subarray operator is the array from which the block is to be sampled. The second parameter section specifies which block is to be extracted. There are as many pairs of array index values in the parameter block as the input : array has dimensions. For each dimension, the operator requires the minimum index value for each dimension first, followed by the maximum index value for each dimension. In other words, if the input : array has two dimensions, then the subarray operator requires four start and end values.

Example:

Suppose 'my_array' is a 2-dimensional array with two integer attributes and dimensions x and y. The following table illustrates the contents of this array. This result would be produced by the scan('my_array') query introduced in the previous section.

[
[(1,3),(5,7),(9,11),(13,15),(17,19)],
[(21,23),(25,27),(29,31),(33,35),(37,39)],
[(41,43),(45,47),(49,51),(53,55),(57,59)],
[(61,63),(65,67),(69,71),(73,75),(77,79)],
[(81,83),(85,87),(89,91),(93,95),(97,99)]
]

The first input to the subarray operator is the source array. The rest of the inputs are the indexes of the subarray window. First, the lower bound indexes for all dimensions in the input array, are given, then the upper ones. The result of subarray is an array of smaller or equal size, whose dimensions always start at 0 and span only the length of the subarray region.

We can use the subarray operator as shown here:

  subarray( my_array, 1, 1, 2, 2)

This query will return a block of the input array where the initial (top-left) cell is [ 1,1 ] and the final (bottom-right) cell is at [ 2, 2 ].

y\x01
0(11,11.0)(12,12.0)
1(21,21.0)(22,22.0)

This query will return a block of the input array where the initial (top-left) cell is [ 4,4 ] and the final (bottom-right) cell is at [ 6,6 ].

 subarray ( my_array, 4, 4, 6, 6)
y\x012
0(44,44.0)(45,45.0)(46,46.0)
1(54,54.0)(55,55.0)(56,56.0)
2(64,64.0)(65,65.0)(66,66.0)
 subarray ( my_array, 4, 3, 6, 7)
y\x01234
0(43,43.0)(44,44.0)(45,45.0)(46,46.0)(47,47.0)
1(53,53.0)(54,54.0)(55,55.0)(56,56.0)(57,57.0)
2(63,63.0)(64,64.0)(65,65.0)(66,66.0)(67,67.0)

3. Selecting a subset of data within a specified region - between

The between operator accepts an input array and a set of coordinates specifying a region within the array. The output is an array of the same shape as input, where all cells outside of the given region are marked empty.

Signature:

between ( array, low-boundary-coordinate-1: value, ..., low-boundary-coordinate-N: value, high-boundary-coordinate-1: value, ..., high-boundary-coordinate-N: value )

Note that between is very similar to subarray, except that between does not change the shape of the array, and does not change the starting coordinates.

For example, consider the following four-by-four array of integers:

scan(fbf)
[[(0),(1)],[(4),(5)]];[[(2),(3)],[(6),(7)]];[[(8),(9)],[(12),(13)]];[[(10),(11)],[(14),(15)]]

Between can be used to set all cells outside of the (1,1)->(2,2) region to empty:

between(fbf,1,1,2,2)
[[(),()],[(),(5,true)]];[[(),()],[(6,true),()]];[[(),(9,true)],[(),()]];[[(10,true),()],[(),()]]

Note that a new boolean attribute referred to as the emptyTag is appended to each element of the between result. This boolean attribute is set to true for all elements that are included in the between result and false for the rest of the elements of the array.

If the input to subarray or between is an array with non-integer dimension, the range must be specified in terms of the dimension value. The following example shows the use of subarray with float dimensions.

create immutable empty array B <b:float>[a(float)=3,3,0];

-- Use redimension_store to add data --
...
-- Now, use unpack to inspect the coordinate values -- 
unpack(B,row);
[(1.05,5.08,true),(2.05,6.08,true),(3.05,7.08,true)]

subarray(B,float(1),float(3));
[(5.08,true),(6.08,true)]

4. Selecting a subset of data by boolean expression - filter

The filter operator 'filters' out data in the array based on an expression over the attribute and dimension values. Suppose we want to filter out all the data from my_array, where the value of attribute A doesn't equal 75. We would use the following query:

 filter( my_array, A <> 75)

The filter operator marks all cells in the input which do not satisfy the predicate expression to 'empty'.

Signature:

filter ( array, condition : expression )

where 'condition' is a boolean expression.

The following examples illustrate both the functionality of the filter operator, and also how operators in AFL can be combined into query expressions.

  filter ( subarray ( my_array, 4, 4, 6, 6), A > 46 )
y\x012
0(0,0.0,false)(0,0.0,false)(0,0.0,false)
1(54,54.0,true)(55,55.0,true)(56,56.0,true)
2(64,64.0,true)(65,65.0,true)(66,66.0,true)
filter ( subarray ( my_array, 4, 4, 6, 6), A > 46 and A < 56)

y\x012
0(0,0.0,false)(0,0.0,false)(0,0.0,false)
1(54,54.0,true)(55,55.0,true)(56,56.0,true)
2(0,0.0,false)(0,0.0,false)(0,0.0,false)

The output of the filter is an array with the EMPTY attribute.


5. Select from an array based on a pattern - lookup

Lookup maps elements from the second array using the attributes of the first array as coordinates into the second array. The result array has the same shape as first array and the same attributes as second array.

Signature:

lookup ( pattern : array, source: array )

Example: suppose the array foo has coordinates 0 to 7 and the values 1 to 8:

scan(foo)
[(1),(2),(3),(4),(5),(6),(7),(8)]

Suppose the array bar ranges from -10 to 10 and has values from 40 to 60:

scan(bar)
[(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60)]

Then lookup(foo,bar) returns 8 values, corresponding to bar positions from 1 to 8 :

lookup(foo,bar)
[(51),(52),(53),(54),(55),(56),(57),(58)]

6. Projecting (selecting) array attributes - project

Project the input array on the specified attributes, in the specified order. Attributes that are not specified are excluded from the output.

Signature:

project ( array, attribute1 : attribute_identifier [, attribute-2 : attribute_identifier ...] )

Example:

 project( subarray ( my_array, 4, 4, 6, 6 ), A)
y\x012
0(44)(45)(46)
1(54)(55)(56)
2(64)(65)(66)
 project( apply ( subarray ( my_array, 4, 4, 6, 6 ), C, A + 20 ), C)
y\x012
0(64.0)(65.0)(66.0)
1(74.0)(75.0)(76.0)
2(84.0)(85.0)(86.0)

7. Joining two arrays - join

Join combines the attributes of two input arrays at matching dimension values.

Signature:

join ( left : array, right : array )

The following example illustrates a join between two different subarray operators applied to the same underlying array. Note that join combines attributes from cells at dimension addresses on its immediate input arrays; not in the original inputs:

 subarray ( my_array AS M1, 4, 4, 6, 6 )
y\x012
0(44,44.0)(45,45.0)(46,46.0)
1(54,54.0)(55,55.0)(56,56.0)
2(64,64.0)(65,65.0)(66,66.0)
 subarray ( my_array AS M2, 6, 6, 8, 8 )
y\x012
0(66,66.0)(67,67.0)(68,68.0)
1(76,76.0)(77,77.0)(78,78.0)
2(86,86.0)(87,87.0)(88,88.0)
 join ( subarray ( my_array AS M1, 4, 4, 6, 6 ), subarray ( my_array AS M2, 6, 6, 8, 8 ) )
y\x012
0(44,44.0,66,66.0)(45,45.0,67,67.0)(46,46.0,68,68.0)
1(54,54.0,76,76.0)(55,55.0,77,77.0)(56,56.0,78,78.0)
2(64,64.0,86,86.0)(65,65.0,87,87.0)(66,66.0,88,88.0)

The join result has the same dimension names as the first input. The left and right arrays must have the same shape. If a cell in either the left or right array is empty, the corresponding cell in the result is also empty.


8. Computing new values - apply

Use the apply operator to compute new values from attributes and indexes of input arrays. The value(s) computed in the apply are appended to the attributes in the input array.

Signature:

apply( array, new_attribute : attribute_identifier, expr_to_apply : expression )

For example, the query below will compute a new attribute C, whose value is given by the expression A + x. The resulting array will have all the old attributes, as well as the new attribute C.

 apply ( subarray ( my_array, 4, 4, 6, 6 ), C, A + 20 )
y\x012
0(44,44.0,64.0)(45,45.0,65.0)(46,46.0,66.0)
1(54,54.0,74.0)(55,55.0,75.0)(56,56.0,76.0)
2(64,64.0,84.0)(65,65.0,85.0)(66,66.0,86.0)

Queries that need more than one apply must nest them.

 apply ( apply ( subarray ( my_array, 4, 4, 6, 6 ), C, A + 20 ), D, A + X)
y\x012
0(44,44.0,64.0,44.0)(45,45.0,65.0,45.0)(46,46.0,66.0,46.0)
1(54,54.0,74.0,55.0)(55,55.0,75.0,56.0)(56,56.0,76.0,57.0)
2(64,64.0,84.0,66.0)(65,65.0,85.0,67.0)(66,66.0,86.0,68.0)

9. Statistical summaries - aggregate

This operator takes a list of dimension names as a group-by list, groups the array by the specified dimension and computes an aggregate expression for each group. The result of aggregate is an array with the shape corresponding to the specified group-by dimensions and two attributes: one containing the result of the aggregate function and the second - a counter of elements evaluated by the aggregate function.

An aggregate expression is arbitrary expression over the attributes or dimensions of the input array and also two extra attributes: aggregate-attribute and count attributes.

Signature:

aggregate( array, dimension1 : dimension_identifier [... , dimensionN : dimension_identifier ], result-name : attribute_identifier, initial-value : value, aggregation : expression )

Example:

store ( build ( <val:double> [x=0:2,3,0,y=0:2,3,0], x*3+y), m3x3);
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

aggregate(m3x3, c, 0, c+1);
[(9,9)]

aggregate(m3x3, x, sum, 0, sum+val);
[(3,3),(12,3),(21,3)]

This aggregate operator does not support distributed execution.

Several of the common aggregate operations have built in operators, which support distributed execution. We discuss these below.

9.1. Arithmetic mean - avg

Calculate the average value of the specified attribute in the array. The result is an array with single element containing average value. If the input array contains only one attribute, then attribute name can be omitted.

Signature:

avg( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

The first argument is the array to be aggregated over. The second argument is the name of the attribute to use. Additional dimension arguments are optional. If present, the list of dimensions specified in the avg operator is used to perform a group-by average. The result is organized as an array with the remaining dimensions from the source array, after grouping has been performed based on the group-by dimensions.

Example: Support my_array is a 2 D array with attributes a: int32 and b: double, and dimensions x and y. The following command returns the average value of a over all elements of array. The second example below calculates the average value of b after grouping over all values of the dimension x. The second example below results in a one-dimensional array whose dimension is the remaining dimension y from the input.

avg(m3x3, val)
[(4)]

avg(m3x3, val, y)
[(3),(4),(5)]

9.2 Count of cells - count

Counts non-empty cells of the input array. When dimensions are provided they are used to do a group-by and a count per resulting group is returned.

Signature:

count( array [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier]] )

Example:

count(m3x3)
[(9)]

count(m3x3,x)
[(3),(3),(3)]

9.3 Maximum value - max

Calculate maximum of the specified attribute in the array. Result is an array with single element containing maximum of specified attribute.

If input array contains only one attribute, then attribute name can be omitted. Again, if dimensions are provided, they are used to produce groups and the maximum of each group is returned.

Signature:

max( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

Example:

max(m3x3)
[(8)]

max(m3x3,val,x)
[(2),(5),(8)]

9.4 Minimum value - min

Calculates the minimum value of the specified attribute in the array. Result is an array with single element containing minimum of specified attribute. If input array contains only one attribute, then attribute name can be omitted. If an attribute list is specified, the result is an array with the remaining dimensions from the source array, and the minimum is evaluated over groups, where each group is the set of all elements matching the group by dimension(s).

Signature:

min( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

Example:

min(m3x3)
[(0)]

min(m3x3,val,y)
[(0),(1),(2)]

9.5 Sum of attribute values - sum

Calculate sum of the specified attribute in the array. Result is an array with single element containing sum of specified attribute. If input array contains only one attribute, then attribute name can be omitted. Again, similar group by semantics as with the other aggregates, if a dimension list is present.

Signature:

sum( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

Example:

sum(m3x3)
[(36)]

sum(m3x3,val,x)
[(3),(12),(21)]

9.6 Variance of attribute values - var

Calculate the variance of the specified attribute in the array. Result is an array with single element containing the variance of specified attribute. If input array contains only one attribute, then attribute name can be omitted. Again, similar group by semantics as with the other aggregates is supported, if a dimension list is present.

Signature:

var( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

Example:

var(m3x3)
[(7.5)]

var(m3x3,val,x)
[(1),(1),(1)]

Variance accepts numeric attributes only. The result is the sample variance of the attribute values.

9.6 Standard deviation of attribute values - stdev

Calculate the standard deviation of the specified attribute in the array. Result is an array with single element containing the standard deviation of specified attribute. If input array contains only one attribute, then attribute name can be omitted. Again, similar group by semantics as with the other aggregates is supported, if a dimension list is present.

Signature:

stdev( array [ , attribute-name : attribute_identifier [, dimension1 : dimension_identifier [, dimension2 : dimension_identifier ]]] )

Example:

stdev(m3x3)
[(2.73861)]

stdev(m3x3,val,y)
[(3),(3),(3)]

Standard deviation accepts numeric attributes only. The result is the population standard deviation of the attribute values.


10. Reduce array by aggregating fixed subarrays - regrid

Partition (divide) the cells in the input array into blocks, and for each block, apply a specific aggregate operation over the value(s) of some attribute in each block. The result array contains two attributes: one containing result of aggregate function and second - counter of elements evaluated by aggregate function. Aggregate expression can access attributes or dimensions of the input array and also two extra attributes: aggregate-attribute and count.

Regrid does not allow grids to span array chunks and requires the chunk size to be a multiple of the grid size in each dimension.

Signature:

regrid( array, grid-interval-1 : integer, ..., grid-interval-N : integer, aggregate-attribute-name : attribute_identifier, initial-value : value, aggregate : expression )

Example:

store ( build ( <val:double> [x=0:3,4,0,y=0:3,4,0], x*4+y), m4x4);
[
[(0),(1),(2),(3)],
[(4),(5),(6),(7)],
[(8),(9),(10),(11)],
[(12),(13),(14),(15)]
]

regrid(m4x4, 2,2, sum, 0, sum+val);
[
[(10,4),(18,4)],
[(42,4),(50,4)]
]

11. Full cross product join of two arrays -- cross

Calculates the full cross product join of two arrays, say A (m-dimensional array) and B (n-dimensional array) such that the result is an m+n dimensional array in which each cell is computed as the concatenation of the attribute lists from corresponding cells in arrays A and B. For example, consider a 2-dimensional array A with dimensions i, j, and a 1-dimensional array B with dimension k. The cell at coordinate position {i, j, k} of the output is computed as the concatenation of cells {i, j} of A with cell at coordinate {k} of B.

Signature:

cross( left : array, right : array )

Example:

create array a<a1: double>[i=1:3,3,0, j=1:3,3,0];
create array b<b1: double>[k=1:2,2,0];

store(build(a, i+j), a);
[[(2),(3),(4)],
[(3),(4),(5)],
[(4),(5),(6)]]

store(build(b, 1.0/k), b);
[(1),(0.5)]

cross(a, b);
[[
[(2,1),(2,0.5)],
[(3,1),(3,0.5)],
[(4,1),(4,0.5)]
],
[
[(3,1),(3,0.5)],
[(4,1),(4,0.5)],
[(5,1),(5,0.5)]
],
[
[(4,1),(4,0.5)],
[(5,1),(5,0.5)],
[(6,1),(6,0.5)]
]]

The output is a 3-dimensional array with the following schema:

<a1:double NOT NULL, b1:double NOT NULL>[i=1:3,3,0, j=1:3,3,0, k=1:2,2,0]

12. Cross product join with equality predicates -- cross_join

Calculates the cross product join of two arrays, say A (m-dimensional array) and B (n-dimensional array) with equality predicates applied to pairs of dimensions, one from each input. Predicates can only be computed along dimension pairs that are aligned in their type, size, and chunking.

Assume p such predicates in the cross_join, then the result is an m+n-p dimensional array in which each cell is computed by concatenating the as follows.

For example, consider a 2-dimensional array A with dimensions i, j, and a 1-dimensional array B with dimension k. The result of a cross_join(A, B, j, k) results in a 2-dimensional array with coordinates {i, j} in which the cell at coordinate position {i, j} of the output is computed as the concatenation of cells {i, j} of A with cell at coordinate {k=j} of B.

Signature:

cross_join( left : array, right : array, left_dim_1: dimension_identifier, right_dim_1 : dimension_identifier [, left_dim_2: dimension_identifier, right_dim_2: dimension_identifier,... ] )

Example:

create array a<a1: double>[i=1:3,3,0, j=1:3,3,0];
create array b<b1: double>[k=1:3,3,0];

store(build(a, i+j), a);
[[(2),(3),(4)],
[(3),(4),(5)],
[(4),(5),(6)]]

store(build(b, 1.0/k), b);
[(1),(0.5), (0.333)]

cross_join(a, b, j, k);
[
[(2,1),(3,0.5),(4,0.333333)],
[(3,1),(4,0.5),(5,0.333333)],
[(4,1),(5,0.5),(6,0.333333)]
]

The result has schema:

<a:double NOT NULL, b:double NOT NULL>[i=1:3,3,0, j=1:3,3,0]

13. Random sample of cells/chunks from an array -- sample and bernoulli

SciDB supports two methods to sample a subset of elements from an array. They are the sample and bernoulli operators. In both, the user provides a sampling probability, which is the probability of inclusion.

In bernoulli, we evaluate each cell by generating a random number and seeing if it lies in the range (0, probability). If it does, the cell is included.

In sample, the selection is chunk-based. We evaluate each chunk for inclusion based on the probability.

Users often want samples with repeatable results. To achieve this, we use a seed value for the underlying random number generator.

Signature:

sample( array, probability : double [, seed : int64 ] )

bernoulli( array, probability : double [, seed : int64 ] )

Example:

bernoulli(m4x4,0.3)
[
[(0),(1),(0),(3)],
[(0),(0),(0),(0)],
[(8),(0),(0),(0)],
[(12),(13),(0),(0)]
]

14. Substitute NULL values -- substitute

Substitute null values in the input array, using their missing reason code, if using the "?code" format, as index in the second (substitution) array.

Signature:

substitute( array, mask: array )

For example, array foo contains two nulls (which is equivalent to ?0) and one ?1:

scan(foo)
[(1),(2),(null),(4),(5),(null),(7),(?1)]

We can use substitute to replace ?0 with 20 and ?1 with 30, for instance. We build an array where 0 maps to 20 and 1 maps to 30:

build(<val:double> [x=0:1,2,0], iif( x = 0, 20, 30))"
[(20),(30)]

We can use this build expression as part of substitute directly - without having to store an array:

substitute(foo,build(<val:double> [x=0:1,2,0], iif( x = 0, 20, 30)))"
[(1),(2),(20),(4),(5),(20),(7),(30)]

Both inputs to substitute must be single-attribute arrays.


Changing array shapes

List of AFL Operators that change the shape or dimensionality of an array.

In the case of some shape changing operators that require a fully qualified array schema, SciDB supports two ways to specify this schema. The first is by creating a new array with the desired output schema and referring to it - useful for storing the changed shape. The second method is to specify an inline anonymous schema. The inline schema is a string that specifies the type, attributes, and dimensions of the result array.

Example:

show(src);
[("a1<a:double NOT NULL> [i=1:3,3,0,j=1:3,3,0]")]

reshape(src, foo)
-- or --
reshape(src, <a: double>[k=1:9,9,0]);

1. Adding and Deleting dimensions - adddim and deldim

From time to time it is necessary to change the structure of an array by adding or deleting a dimension.

The adddim operator prepends the existing dimensions for an array with a new dimension, whose name is supplied as the second argument to the operator.

The new dimension will have start = 0, length = 1, chunkInterval = 1, overlap = 0.

Deldim deletes the left-most dimension from the array. Deleted dimension must have size 1.

Signature:

adddim( array, new_dimension : dimension_name )

deldim( array )

Examples:

adddim(matrix, timestamp)
deldim(subarray(matrix, 1, 100, 1, 200))

2. Concatenate two arrays - concat

Concatenate two arrays. Array should have the name number of dimensions. Concatenation is performed by the left-most dimension. All other dimensions of the input arrays must match. The left-most dimension of both arrays must have a fixed size (not unbounded) and same chunking schema. Both inputs must have the same attributes.

Signature:

concat( left: array, right : array )

Example:

store(build(<val:int64>[x=0:3,100,0,y=0:4,100,0],x*5+y),four_by_five)
[
[(0),(1),(2),(3),(4)],
[(5),(6),(7),(8),(9)],
[(10),(11),(12),(13),(14)],
[(15),(16),(17),(18),(19)]
]

store(build(<val:int64>[x=0:1,100,0,y=0:4,100,0],20+x*2+y),two_by_five)
[
[(20),(21),(22),(23),(24)],
[(22),(23),(24),(25),(26)]
]

concat(four_by_five,two_by_five)
[
[(0),(1),(2),(3),(4)],
[(5),(6),(7),(8),(9)],
[(10),(11),(12),(13),(14)],
[(15),(16),(17),(18),(19)],
[(20),(21),(22),(23),(24)],
[(22),(23),(24),(25),(26)]
]

Note that concat can be combined with transpose, adddim, deldim, subarray, slice and other ops for more complex transformations.


3. Merge attributes - merge

Merge two arrays. The two input arrays should have the same shape as one another: that is, the same attribute list and dimensions.

Merge combines elements from the input array the following way: for each cell in the two inputs, if the cell of first (left) array is non-empty, then the attributes from that cell are selected and placed in the output. If the cell in the first array is marked as empty, then the attributes of the corresponding cell in the second array are taken. If the cell is empty in both input arrays, the output's cell is set to empty.

Signature:

merge( left : array, right : array )

Example:

scan(vec1)
[[(1,true)],[()],[(3,true)],[()],[(5,true)]]

scan(vec2)
[[()],[(5,true)],[()],[(7,true)],[()]]

merge(vec1,vec2)
[[(1,true)],[(5,true)],[(3,true)],[(7,true)],[(5,true)]]

4. Reshape an array - reshape

Change the shape of an array to the rank and dimensions of another array.

Signature:

reshape( array, target : array_identifier | anonymous_schema )

NOTES:

  1. The target parameter should have the same number of attributes as the input array.
  2. The arrays must have fixed size dimensions. That is, reshape for unbounded arrays is not supported.
  3. Size of the input and target array should be the same. That is, both arrays should have the same number of cells. For example, it is possible to reshape 2x2x2 array to 4x2, but not 3x3 to 2x2.

Example:

show(m3x3)
[("m3x3<val:double NOT NULL> [x=0:2,3,0,y=0:2,3,0]")]

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

reshape(m3x3, <val:double>[i=0:8,2,0])
[(0),(1),(2),(3),(4),(5),(6),(7),(8)]

5. Re-partition an array - repart

Change partitioning (chunking) of the array. Target array must have the same attributes and dimensions, but chunk size may be different. Repart returns an array whose attributes are taken from the input array, with the dimensions of the target.

Signature:

repart( array, target : array_identifier | anonymous_schema )

Example:

show(m3x3)
[("m3x3<val:double NOT NULL> [x=0:2,3,0,y=0:2,3,0]")]

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

repart(m3x3, <val:double>[x=0:2,1,0,y=0:2,1,0])
[[(0)]];[[(1)]];[[(2)]];
[[(3)]];[[(4)]];[[(5)]];
[[(6)]];[[(7)]];[[(8)]]

6. Transform attributes to dimensions - redimension and redimension_store

Transform an input array by changing its dimensionality -- specifically convert some of its attributes to dimensions.

Signature:

redimension( array, target : array_identifier | anonymous_schema )

redimension_store( array, target : array_identifier )

redimension does not create or update array storage or metadata, and returns the transformed array result. redimension also only works when transforming int64 attributes into dimensions.

redimension_store updates the target_array storage and created additional mapping arrays if necessary.

For both variants of redimension, the target array must be empty-able.

When converting attributes to dimensions, duplicate cells that correspond to the same target position are dropped. Only one element is selected, and the choice is non deterministic.

The input and target arrays must have compatible schemas, and both commands determine the list of transformations (attribute to dimension) by matching names in the attribute and dimension lists of the two arrays. It is possible to omit certain attributes and dimensions from the target array.

For example, a one-dimensional array can be transformed into a higher dimensional array by changing some attributes into dimensions:

show(sd)
[("sd<num:double NOT NULL,label:string NOT NULL> [x=0:8,8,0]")]

scan(sd)
[(1,"a"),(2,"z"),(0,"b"),(3,"x"),(2,"f"),(6,"o"),(1.5,"e"),(2.7,"l")]

create empty array target <num:double>[x=0:8,8,0, label(string)=8,8,0]

redimension_store(sd, target)
[
[(1,true),(),(),(),(),(),(),()],
[(),(),(),(),(),(),(),(2,true)],
[(),(0,true),(),(),(),(),(),()],
[(),(),(),(),(),(),(3,true),()],
[(),(),(),(2,true),(),(),(),()],
[(),(),(),(),(),(6,true),(),()],
[(),(),(1.5,true),(),(),(),(),()],
[(),(),(),(),(2.7,true),(),(),()]
]

In the example above, redimension_store is used to create a two-dimensional array target from a one-dimensional array sd. redimension_store assigns the values of attribute label from sd to the new dimension in target. The resulting array target uses the non-integer dimension label and the mapping array target@1:label is created as part of the operation.


7. Expanding an element into a grid - xgrid

Use this operator to scale an input array by repeating cells of the original array specified number of times. This operator can be considered as the inverse of the regrid operator. While regrid splits an input array into "hypercubes" and calculates an aggregate function over them, xgrid produces a hypercube from the single element of the input array by repeating that element.

Signature:

xgrid( array, scale-1, ..., scale-N )

Example:

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

xgrid(m3x3,2,2)"
[
[(0),(0),(1),(1),(2),(2)],
[(0),(0),(1),(1),(2),(2)],
[(3),(3),(4),(4),(5),(5)],
[(3),(3),(4),(4),(5),(5)],
[(6),(6),(7),(7),(8),(8)],
[(6),(6),(7),(7),(8),(8)]
]

8. Select a subplane of the array - slice

Get a slice of the array. Result is a slice of the input array corresponding to the given coordinate value(s). Number of dimensions of the result array is equal to the number of dimensions of input array minus number of specified dimension, and the coordinate value should be a valid dimension value of the input array.

Signature:

slice( array, dimension1 : dimension_identifier, coordinate1 : value [ ,dimensionN : dimension_identifier, coordinate-value-N : value ] )

Example:

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

slice(m3x3,x,1)
[(3),(4),(5)]

9. Sort array by attribute - sort

Sort a one-dimensional array by one or more attributes. The sort attributes are specified using a 1-based attribute number.

Signature:

sort( array, [-] attribute-number-1 : integer [ , [-] attribute-number-I : integer ] )

Example:

show(sd)
[("sd<num:double NOT NULL,label:string NOT NULL> [x=0:8,8,0]")]

scan(sd)
[(1,"a"),(2,"z"),(0,"b"),(3,"x"),(2,"f"),(6,"o"),(1.5,"e"),(2.7,"l")]

sort(sd,1)
[(0,"b"),(1,"a"),(1.5,"e"),(2,"z"),(2,"f"),(2.7,"l"),(3,"x"),(6,"o")]

Unadorned attributes imply ascending order. Pre-fixing an attribute's position with a '-' symbol tells the sort operator to sort in descending order.

sort(sd,-2)
[(2,"z"),(3,"x"),(6,"o"),(2.7,"l"),(2,"f"),(1.5,"e"),(0,"b"),(1,"a")]

10. Transform into a single dimensional array - unpack

Unpack array into a single-dimensional array creating new attributes to represent source array dimension values. Result array has a single zero-based dimension and arguments combining attributes of the input array. The name for the new single dimension is passed to the operator as the second argument.

Signature:

unpack( array, attribute_name )

Example:

show(m3x3)
[("m3x3<val:double NOT NULL> [x=0:2,3,0,y=0:2,3,0]")]

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

unpack(m3x3,i)
[(0,0,0),(0,1,1),(0,2,2),(1,0,3),(1,1,4),(1,2,5),(2,0,6),(2,1,7),(2,2,8)]

The format for the output of the unpack operator is to begin by enumerating the dimensions of the input, and then to append the attribute values. Chunk size -- ie., the number of elements per chunk is preserved from input to output. The resulting one-dimensional array has chunk dimension equal to the count of cells in each input chunk.

NOTE: unpack only supports arrays whose size is evenly divisible by chunk size in every dimension. You cannot unpack an array A <a:int32> [x=0:100,30,0]; but you can unpack an array A <a:int32> [x=0:100,20,0].


11. Transpose an array - transpose

Transpose an array by inverting the order of its dimensions.

Signature:

transpose( array )

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

transpose(m3x3)
[
[(0),(3),(6)],
[(1),(4),(7)],
[(2),(5),(8)]
]

12. Reverse array elements - reverse

"Invert" or reverse the elements of an array by reversing the values of each dimension.

Signature:

reverse( array )

Assume that dense is a 2-dimensional array with the following definition and contents:

CREATE ARRAY dense <a: int32>[x=0:5,3,0, y=0:5,3,0]

scan(dense)
[
[(66),(65),(64)],
[(56),(55),(54)],
[(46),(45),(44)]
];
[
[(63),(62),(61)],
[(53),(52),(51)],
[(43),(42),(41)]
];
[
[(36),(35),(34)],
[(26),(25),(24)],
[(16),(15),(14)]
];
[
[(33),(32),(31)],
[(23),(22),(21)],
[(13),(12),(11)]
]

reverse returns an array in which each dimension is the reverse of the corresponding dimension of the source array.

reverse(dense)
[
[(11),(12),(13)],
[(21),(22),(23)],
[(31),(32),(33)]
];
[
[(14),(15),(16)],
[(24),(25),(26)],
[(34),(35),(36)]
];
[
[(41),(42),(43)],
[(51),(52),(53)],
[(61),(62),(63)]
];
[
[(44),(45),(46)],
[(54),(55),(56)],
[(64),(65),(66)]
]

Matrix and Vector Operations

Some operators accept arrays as inputs and interpret them as matrices. As such, the term matrix refers to an array that has two dimensions and one numeric attribute. Furthermore, matrix operators the first declared dimension as rows, and the second dimension as columns.

1. Multiply two matrices - multiply

Perform a linear algebraic cross product (multiply) on two matrices. Matrices should have two dimensions and a single double precision attribute.

Signature: multiply( left : array, right : array )

Example:

scan(m3x3)
[
[(0),(1),(2)],
[(3),(4),(5)],
[(6),(7),(8)]
]

multiply(m3x3,m3x3)
[
[(15),(18),(21)],
[(42),(54),(66)],
[(69),(90),(111)]
]

2. Normalize vector elements - norm

Calculate the L2-norm of a vector by dividing each element by the square root of the sum of squares of the elements.

Signature: norm( array, attribute : attribute_identifier )

Example:

scan(v10)
[(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)]

norm(v10)
[(0.0509647),(0.101929),(0.152894),(0.203859),(0.254824),(0.305788),(0.356753),(0.407718),(0.458682),(0.509647)]

Loading and Unloading Plugins

SciDB supports user-defined types, functions, and operators using SciDB's extension architecture. Operators are written using the UDO array API. For details on how to write a user-defined function or type for SciDB, please refer to the "SciDB Plugins" documentation for this release. Plugins are built as shared libraries and are loaded from the plugin directory specified in the config.ini file.

load_library( 'librational.so' );

Loading a plugin registers the library in the SciDB system catalog. Plugins registered with the SciDB instance will be loaded at system start time. Plugins will be re-loaded from the .so file on system restart.

To unload a library,

unload_library( 'librational' )

This command will unregister the library in the system catalog. The library will not be loaded on next restart, but will remain available in the sever until then.


Debug and Internal commands

1. Cancel a running query - cancel

Cancel a currently running query by query id.

Signature: cancel( query_id : integer )

Example:

cancel(12345)

The query id can be obtained from the scidb log or via the list() command. SciDB maintains query context information for each completed and in-progress query in the server. If the user issues a "ctrl-C" or abort from the client, the query is cancelled and its context is removed from the server.

In-progress queries are aborted by issuing a cancel() command with the query ID. Context information is also removed by this command.

This query context can also be removed for completed queries by issuing a cancel() command.

Query context information maintained by the server is not persistent and is lost on server restart.


2. Output a string - echo

Simply accepts a string and returns a single-element array containing the string.

Signature:

echo( string )

Example:

echo('Hello world')
[("Hello world")]

3. Explain plan - explain_logical and explain_physical

The operators explain_logical and explain_physical can be used to emit a human-readable plan string for a particular query without running the query itself. SciDB first constructs a logical plan, optimizes it and then translates it into a physical plan.

Signature:

explain_logical( query: string, language: string ) explain_physical ( query: string, language: string )

where language corresponds to the language of the query string and is either afl or aql.

Example:

explain_physical('store(join(a, b), c)','afl')

Note that the output of the explain operators is a single-element array that contains the plan string. The plan text format is intended for SciDB developers and experts and is subject to change in the future.


4. Get the signature of an operator - help

Accepts an operator name and returns an array containing a human-readable signature for that operator.

Signature: help ( operator_name: string )

Example:

help('multiply')
[("Operator: multiply
Usage: multiply(<input>, <input>)")]

5. Scatter/Gather an array - sg

The SG operator is used internally to redistribute or repartition the elements of an array between SciDB processing nodes. SG by itself does not perform any computation and its use in user queries can lead to inefficiencies and is discouraged.

sg redistributes chunks of the input array across cluster nodes according to a specified policy. sg is not a user-visible operator, but it is heavily used by the system internally. It is a physical operator for data movement within the system and users are not expected to invoke it directly. For example, the scidb optimizer inserts an sg operator after load to redistribute array data using round robin distribution. The optimizer may also insert sg operators to assist with aggregations, operators that don't preserve shape or operators that don't preserve distribution.

The signature of 'sg' is: sg ( array, [ partitioning schema [, node id [, result array name [, store_flag [, offset_x, offset_y, ... ] ] ] ] ])

The following table describes the arguments to sg:

arg name Description Comment
input_array array to be redistributed. May also be the output of another operator like scan, filter, etc.
partitioning schema integer constant specifying partitioning schema. Possible values are shown in the table below.
node number a node number to send all chunks. Used only with psLocalNode. Ignored for other partitioning schemas.
result array a name of result array. If omitted, the result of sg is nameless and temporary with query duration.
store_flag a boolean indicator used to specify whether sg should store the result array. By default, true if result array name is given. If store_flag is true, the result of sg is persisted as a new array with given name. If store_flag is false, the result is a temporary named array that can be referenced elsewhere in the query.
offset a distribution offset vector. For internal use.

The partition schema is one of the following types.

partition schema Description Comment
0 psReplication replicate array on every node of cluster
1 psRoundRobin distribute chunks by round robin
2 psLocalNode send every local chunk to specified node
3 psByRow distribute every chunk to node according to chunk coordinate of dimension 0
4 psByCol distribute every chunk to node according to chunk coordinate of dimension 1

Examples: Query to distribute array by round robin onto every node:

sg(seq_info, 1, -1, seq_info_sg)

Query to gather whole array on node 0 in memory array (will not be stored on disk):

sg(seq_info_sg, 2, 0)

6. Reduce the distribution of a replicated array - reduce_distro

Given an array that is distributed with psReplication - quickly reduce the distribution to a different schema by masking certain chunks. For internal use.

Signature:

reduce_distro( array, partitioning_schema: integer )

Example:

reduce_distro(\"target@1:label\",2)
[("a"),("b"),("e"),("f"),("l"),("o"),("x"),("z")]

Attachments