Contents:
- AFL: SciDB Array Functional Language
-
DDL
- 1. CREATE ARRAY command
- 2. Loading Data into a SciDB Array - load()
- 3. Unload array data to a file - save()
- 4. Building array data - build()
- 5. Build a sparse array - build_sparse()
- 6. Store data in an array - store()
- 7. Input - input()
- 8. Updates - store()
- 9. Remove an array - remove()
- 10. Rename an array - rename()
- 11. Change the attribute names, or dimension names of an array
- Metadata Operators
-
DML
- Overview of AFL DML
- 1. Selecting all the data in the array - scan()
- 2. Selecting subset of data by array index - subsample()
- 3. Selecting a subset of data within a specified region - between()
- 4. Selecting a subset of data by attribute value - filter()
- 5. Select from an array based on a pattern - lookup()
- 6. Projecting (selecting) array attributes - project()
- 7. Joining two arrays - join()
- 8. Computing new values - apply()
- 9. Statistical summaries - aggregate()
- 10. Using the regrid operator - regrid()
-
Changing array shapes
- 1. Adding and Deleting dimensions - adddim() and deldim()
- 2. Concatenate two arrays - concat()
- 3. Merge attributes - merge()
- 4. Reshape an array - reshape()
- 5. Re-partition an array - repart ()
- 6. Expanding an element into a grid - xgrid()
- 7. Remove dimension(s) - slice()
- 8. Sort array by attribute - sort()
- 9. Representing arrays as single dimensional array - unpack()
- 10. Transpose an array - transpose()
- 11. Reverse array elements - reverse()
- Debug and Internal commands
- Matrix and Vector Operations
AFL: SciDB Array Functional Language
SciDB provides two programming interfaces:
- AQL, an array query language that is similar to SQL.
- AFL, a functional language that provides the same capabilities as AQL but with a functional syntax. AFL also provides support for querying array metadata and additional array operators.
This page describes the features of AFL supported in R0.75.
While AFL may be revised over time as the implementation evolves and matures, the AQL interface will remain fixed with a commitment to backward compatibility.
AFL divides up into three sub-languages:
- Data Definition Language (DDL). The DDL allows users to create, load, remove and modify arrays and their structure in a SciDB instance.
- Data Manipulation Language (DML). DML allows users to query the contents of arrays in a SciDB instance.
- Metadata operators. These operators provide information about the state of the database, listing arrays, types, operators and so on.
This document also covers AFL operators used for debugging and Linear Algebra operators.
DDL
1. CREATE ARRAY command
The CREATE ARRAY statement is used to create new arrays. The statement specifies the shape (the number of dimensions and their sizes) of the array, and the attributes which appear in each cell. A CREATE ARRAY statement has the following form.
CREATE ARRAY array_name array_type < attribute_name : type_name [, ...] > [ dimension_name = start|* : end|*, chunk_interval, chunk_overlap [, ...] ]
SciDB arrays have three elements; a name, an attribute list, and a specification of the array's structure (dimensional organization).
| Array Feature | Optional | Description |
| array_name | No | The string name of the array, which is an AFL identifier. The array name uniquely identifies the array in the SciDB instance. That is, you can't use the same array name twice in one SciDB instance. |
| array_type | Yes | 'updatable' or 'empty'. Default type is not updatable nor empty. |
| attribute_name | Name of an attribute, an AFL identifier. No two attributes in the same array can share a name. | |
| type_name | No | Type identifier. One of the types supported by SciDB: int, float, double, bool, string, char |
| dimension_name | No | Each dimension has a name that is also an AFL identifier. Just like attributes, each dimension must be named, and dimension names cannot be repeated in the same array |
| start (integer) | No | The starting position of the coordinate in this dimension, or * if unknown |
| end(integer) | No | The ending position of the coordinate in this dimension, or * if unknown |
| chunk_interval | No | The length of data chunk in this dimension. The chunking parameters will be changed in V 1.0. |
| chunk_overlap | No | The length of overlap in this dimension. The chunking parameters will be changed in V 1.0. |
Basic examples
SciDB takes the logical array definition--the rank, or number of dimensions in the array, and the length or size of each dimension--and breaks the data in it into chunks. The size of the chunks must evenly divide the logical array into disjoint segments. Also - several of our 0.5 operators only work when applied to arrays that are decomposed into chunks of the same size. Chunks can overlap--an important feature of SciDB--although none of our operators depend on it.
The CREATE ARRAY takes the logical array definition--a list of dimensions of the array, and for each dimension, the length or size of that dimension, its minimum and maximum values. SciDB breaks the data in it into chunks. The size of the chunks must evenly divide the logical array into equal sized segments. Also - several of our AFL operators only work when applied to arrays that are decomposed into chunks of the same size. Chunks can overlap--an important feature of SciDB--although none of our operators depend on it.
What follows is several examples of how CREATE ARRAY statements can be used.
Example One:
For example, to create a 2D array with dimension indices
- having dimension names 'x' and 'y',
- each ranging between 0 and 49 (having length 50)
- and stored within 25 chunks each of size 10x10.
- with no chunk overlap
and a single integer attribute named 'Val'
CREATE ARRAY Example_One < Val: int32 > [ X=0:49,10,0, Y=0:49,10,0]
Example Two:
Or a 3D array with dimension indices
- having names X, Y and Z ,
- with values for the X and Y dimensions ranging between 0 and 9 (length 10) and the Z dimension ranging over 0:99 (length = 100)
- stored entirely within a single chunk of size 10x10x100.
- with no overlap.
and two attributes A and B being a string and a double precision value:
CREATE ARRAY Example_Two < A: string, B: double > [ X=0:9,10,0, Y=0:9,10,0, Z=0:99,100,0 ]
Example Three:
The following is a 2D array with dimensions
- having names I and J i with I ranging over 0 to 99 and J over 0 to 199, and
- broken into chunks of size 10x20
The array's cells contain a pair of attribute values, A and B, having types int32 and float.
CREATE ARRAY Example_Three <A:int32, B:float> [I=0:99,10,0, J=0:199,20,0]
Example Four:
And finally, the following CREATE ARRAY Example_Four (which becomes the basis for much of the descriptive material that follows).
The following is a 2D array with dimensions
- having names I and J i with I ranging over 0 to 9 and J over 0 to 9, and
- fitting into a single chunk of size 10x10
Our example has two attributes, A and B, having types int32 and double.
CREATE ARRAY Example_Four < A: int32, B: double > [ I=0:9,10,0, J=0:9,10,0 ]
This last array is used later in this document to describe how other operators work.
Null and default attributes
Each attribute specification in the CREATE ARRAY also contains two qualifiers -- the NULL or NOT NULL flag, and the default value. An attribute with the NULL flag set is allowed to take on the special value null. Null typically indicates that an attribute may be missing in the array. Null values have a special representation in the external load file format. This is discussed in detail in the load section of this document.
Each attribute specification can also have an optional default value, which is the value that the attribute takes on if no value is specified for that attribute.
The syntax for specifying a default value in the CREATE ARRAY statement is shown below.
CREATE ARRAY A <x: double NOT NULL DEFAULT 1.0, err : double> [i=0:99,10,0, j=0:99,10,0]
We support SQL semantics for NULL values in SciDB. Hence, any comparison or use of null values in expressions will evaluate to NULL, except for the ‘IS NULL’ or 'IS NOT NULL' predicates.
Array data model
Arrays in SciDB are immutable, i.e., once data is stored or loaded into an array, it can never be removed unless the array is removed. SciDB does support the addition of new data into special types of arrays, provided that they are appropriately declared at the time of creation using the CREATE ARRAY command. CREATE ARRAY accepts a special array type qualifier to denote the following two special types of arrays -- updatable arrays, empty (or sparse) arrays, and arrays with unbounded dimensions. The create array declaration is described below for these three special types of arrays.
Updatable arrays
The example below shows how to create an updatable array.
CREATE UPDATABLE ARRAY U1 <a: double > [x=0:5,3,0, y=0:5,3,0]
Updateable array is one whose cells can be updated by a subsequent update statement. Array updates do not modify data that is already present in the array. Instead, new data written to already occupied cells of an array cause a new version to be created. Each update statement creates a new version of the array to be created. For more information on how to access different versions of an array please see AFL input operator. In comparison, a regular array (one that has not be created as an updateable array) does not support the input operator.
SciDB does not allow the array type to be altered after array creation.
Empty or Sparse arrays
If the EMPTY flag is specified during array creation, SciDB will assume that the data being loaded is very sparse and use a sparse physical representation for storing array chunks, automatically switching to the dense physical representation for a given chunk when the density of that chunk exceeds a specified threshold. See the SciDB command line options for this threshold configuration and its default value.
In contrast, when the EMPTY flag is not specified, SciDB loads data into chunks with a dense physical representation. This is true even if the external file format uses a sparse representation.
CREATE EMPTY ARRAY A_10 <x:double null>[i=1:10,5,0]
Unbounded arrays
An array dimension can be created as an unbounded dimension by leaving either the low or the high boundary or both boundaries of the dimension as open using the special character '*'. Examples of unbounded dimensions include I=0:*,10,0 or J=*:*,100,0. In the first example, I has an open high boundary and J has open low and high boundaries.
A regular array does not allow chunks to be loaded into chunk addresses that fall outside the region defined by the create array definition.
With an unbounded array, chunks can be incrementally loaded into the array into new regions of the array and there is no limit on the array dimensions, other than the resources available in the system.
CREATE ARRAY open_array <a:int64>[x=0:*,5,0]
An example of incremental loading is described in the load() section.
Sparse and Dense arrays
SciDB does not use a array type qualifier to distinguish between sparse and dense arrays. A basic array when loaded from a data file that has the sparse load format generates sparse array chunks, whereas one loaded from a dense file format creates chunks that are of the dense format. Sparse and dense are an internal array attribute and do not need to be specified by the user.
2. Loading Data into a SciDB Array - load()
SciDB currently supports one AFL operation for loading data into a SciDB array. Unlike the case with SQL systems the format of the external load data is quite complex, SciDB data load files must organize the data into chunks of the size expected by the target array, and ordered appropriately.
The load() operation has the following signature.
Signature:
load ( array_name : string, data_file : string )
Given an array defined as follows:
CREATE ARRAY Load_Example < Val: int32 > [ X=1:100,25,0, Y=1:100,25,0]
Load files divide the data to be loaded into "chunks". Load file chunks correspond to array chunks as defined by the CREATE ARRAY statement.
Load file data must be organized according to the chunking scheme of the array.
| C11 | C12 | C13 | C14 |
| C21 | C22 | C23 | C24 |
| C31 | C32 | C33 | C34 |
| C41 | C42 | C43 | C44 |
The chunks in the load file need to be represented in the following order:
C 11
C 12
C 13
C 14
C 21
C 22
C 23
C 24
...
C 41
C 42
C 43
C 44
SciDB supports two external file formats intended to support sparse and dense representation of data. In a sparse array the majority of the cells in the array are EMPTY. That is, they contain no attribute data. So the most space efficient way to represent the data is to represent the array as a list of cell index values and the corresponding attribute data.
Sparse data format
The following illustrates sparse array data loading. Consider the Load_Example array above.
A (sparse) load file for this array looks like this. Note that the load file divides the data into chunks--there is a semi-colon between the chunks--and within each chunk the data is organized as a list of comma-separated cells, where each cell includes the dimension values and the attributes of the cell.
The example below shows the sparse array data format for Load_Example.
[[{4,13}(413),{24,18}(2418)]];
[[{16,47}(1647),{17,37}(1737),{23,39}(2339),{23,46}(2346)]];
[[{10,66}(1066),{19,74}(1974),{21,58}(2158),{23,71}(2371)]];
[[{1,91}(191),{5,95}(595),{19,94}(1994),{20,79}(2079),{23,83}(2383)]];
[[{29,12}(2912),{36,5}(3605),{49,16}(4916)]];
[[{33,47}(3347),{35,27}(3527),{37,28}(3728),{42,31}(4231),{44,25}(4425)]];
[[{37,53}(3753),{39,51}(3951),{40,56}(4056),{44,73}(4473),{46,70}(4670)]];
[[{45,76}(4576)]];
[[{56,3}(5603),{58,2}(5802),{60,14}(6014),{69,9}(6909),{73,1}(7301)]];
[[{50,55}(5055),{50,63}(5063),{60,52}(6052)]];
[[{53,95}(5395),{54,83}(5483),{56,88}(5688),{61,89}(6189)]];
[[ ]];
[[{85,52}(8552),{88,67}(8867)]];
[[{76,96}(7696),{82,85}(8285),{85,91}(8591),{86,93}(8693),{90,76}(9076)]]
You would use the following load() syntax to load the file into the array.
load (Example_One, '/tmp/Sparse_Data')
Notes:
- Chunks containing only 'empty' cells can be bracketed with '', or they can be omitted.
- The last chunk in the load file should not have a ';' chunk delimiter.
The following example illustrates how dense arrays are created and loaded.
For example:
CREATE ARRAY Example_Four < A: int32, B: double > [ I=0:9,10,0, J=0:9,10,0 ]
This array is a 2-D array contained within one 10x10 chunk. The load file contents for this array are in a regular file on the file system (say '/tmp/Dense_Data') which looks like this. This file contains the dense load file format.
[ [ (0, 0.0), (1,1.0), ..., (9,9.0)], [ (10, 10.0), (11, 11.0), ..., (19, 19.0)], [ (20, 20.0), (21, 21.0), ..., (29, 29.0)], [ (30, 30.0), (31, 31.0), ..., (39, 39.0)], [ (40, 40.0), (41, 41.0), ..., (49, 49.0)], [ (50, 50.0), (51, 51.0), ..., (59, 59.0)], [ (60, 60.0), (61, 61.0), ..., (69, 69.0)], [ (70, 70.0), (71, 71.0), ..., (79, 79.0)], [ (80, 80.0), (81, 81.0), ..., (89, 89.0)], [ (90, 90.0), (91, 91.0), ..., (99, 99.0)] ]
Dense data format
In the dense representation, the dimension values for each cell are implicit in the representation. Again, the data is divided up into chunks, with a chunk again bracketed by a '' pair. However, within each chunk the attribute data is organized in row / column order (or generalized appropriately to higher dimensions). This 'dense' representation can also handle 'empty' cells, which are indicated by a '()' pair containing no data values.
load (Example_Four, '/tmp/Dense_Data')
A multi-chunk array is created and loaded as shown below.
CREATE ARRAY Two_Dim<a: int32, c: char>[I=0:7,4,0, J=0:7,4,0] load(Two_Dim, '/tmp/2d-mc.txt')
Contents of /tmp/2d-mc.txt:
[ [ (0, 'A'), (1, 'B'), (2, 'C'), (3, 'D')], [ (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L')], [ (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T')], [ (24, 'Y'), (25, 'Z'), (26, 'A'), (27, 'B')] ]; [ [ (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H')], [ (12, 'M'), (13, 'N'), (14, 'O'), (15, 'P')], [ (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X')], [ (28, 'C'), (29, 'D'), (30, 'E'), (31, 'F')] ]; [ [ (32, 'G'), (33, 'H'), (34, 'I'), (35, 'J')], [ (40, 'O'), (41, 'P'), (42, 'Q'), (43, 'R')], [ (48, 'W'), (49, 'X'), (50, 'Y'), (51, 'Z')], [ (56, 'E'), (57, 'F'), (58, 'G'), (59, 'H')] ]; [ [ (36, 'K'), (37, 'L'), (38, 'M'), (39, 'N')], [ (44, 'S'), (45, 'T'), (46, 'U'), (47, 'V')], [ (52, 'A'), (53, 'B'), (54, 'C'), (55, 'D')], [ (60, 'I'), (61, 'J'), (62, 'K'), (63, 'L')] ]
For a singe dimensional array stored in multiple chunks, the load file format looks like this:
[ (36, 'K'), (37, 'L'), (38, 'M'), (39, 'N')]; [ (44, 'S'), (45, 'T'), (46, 'U'), (47, 'V')]; [ (52, 'A'), (53, 'B'), (54, 'C'), (55, 'D')]; [ (60, 'I'), (61, 'J'), (62, 'K'), (63, 'L')]
NOTES:
- The pathname of the load file is relative to the working directory of the scidb process on that server. So, it is best to use absolute pathnames in the load command, for example "/tmp/Dense_Data".
Null attribute format
Both the dense and sparse array formats can include null attributes for missing data.
The following formats are used to represent null values for attributes in chunk data loaded into an array. An attribute may be missing at the time of data loads into the array, for example, if a faulty instrument occasionally fails to report a reading for an attribute, that attribute will be loaded into a SciDB array as a null. If an erroneous instrument reports readings that are out of valid bounds for an attribute, that may be represented as a null within a SciDB array.
In addition, missing values can be tagged with a missing reason code. This code helps a SciDB application distinguish between different types of null values -- for example, assigning a unique code to the following types of errors -- ‘instrument error’, ‘cloud cover’, or ‘not enough data for result’.
The examples below show how to represent missing data in the load file. ? or null represent null values, and ?2 represents null value with a reason code of 2.
[[ ( 10, 4.5, "My String", 'C'), (10, 5.1, ?1, 'D'), (?2, 5.1, "Another String", ?) ... or [[ ( 10, 4.5, "My String", 'C'), (10, 5.1, ?1, 'D'), (?2, 5.1, "Another String", null) ...
Loading unbounded arrays
With an unbounded array, chunks can be incrementally loaded into the array into new regions of the array and there is no limit on the array dimensions, other than the resources available in the system.
CREATE ARRAY open_array <a:int64>[x=0:*,5,0]
An example of incremental loading is described in the load() section.
Load data in three increments into open_array described in the create statement above.
load(open_array, "/tmp/region_1.txt") load(open_array, "/tmp/region_2.txt") load(open_array, "/tmp/region_3.txt")
The load files are shown below and described in more detail in the sections on input and load. Initial load from /tmp/region_1.txt
[(0), (1), (2), (3), (4)];[(5), (6), (7), (8), (9)]
Additional loads: /tmp/region_2.txt
{15}[(10), (11), (12), (13), (14)]
/tmp/region_3.txt
{10}[(15), (16), (17), (18), (19)]
3. Unload array data to a file - save()
Use the save() operator to unload an ASCII formatted copy of the data to an external file on the file system.
The signature of the save() operator is as follows. save( array-name : string, file path : string )
Array name refers to the array to be unloaded and saved. File path is the path to a file to use -- if a relative path name is used this is assumed to be relative to the working directory of the SciDB server.
In a cluster configuration, save() operates as follows. For each node in a cluster save creates a file with the same file name containing only the chunks stored on that node. The entire array data is obtained by combining saved files from all nodes.
4. Building array data - build()
Build a dense single-attribute array, assigning values to its attributes using the specified expression(s). These expression can use any of the array's index coordinates.
Signature:
build( array-name : string, exp : expression )
The build() operator's first argument is the name of an array to be used as the template for the result. build() proceeds through the array, cell by cell, using the value of the exp : expression to compute the value of each cell. The expression can use any of the array's dimension values. On a cluster configuration, build works in parallel on all nodes.
The first example creates an identity matrix. The attribute of the matrix is the result of the build expression -- exp : expression. First example below initializes a matrix with values 1, 2, 3, etc.
Examples:
CREATE ARRAY Build_Example_INT < A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ] build (Build_Example_INT, iif(X=Y, 1, 0))
| y\x | 0 | 1 | 2 | 3 | 4| |
| 0 | (1) | (0) | (0) | (0) | (0) |
| 1 | (0) | (1) | (0) | (0) | (0) |
| 2 | (0) | (0) | (1) | (0) | (0) |
| 3 | (0) | (0) | (0) | (1) | (0) |
| 4 | (0) | (0) | (0) | (0) | (1) |
The following build example
build (Build_Example_INT, 5*X + Y)
| y\x | 0 | 1 | 2 | 3 | 4 |
| 0 | (1) | (2) | (3) | (4) | (5) |
| 1 | (6) | (7) | (8) | (9) | (10) |
| 2 | (11) | (12) | (13) | (14) | (15) |
| 3 | (16) | (17) | (18) | (19) | (20) |
| 4 | (21) | (22) | (23) | (24) | (25) |
NOTES:
- The build() operator does not populate the array passed to it in the first argument. Rather, it uses the structure of that array to build it's result. To store the result of the build() operator you will need to use the store() operator (see below).
5. Build a sparse array - build_sparse()
Construct a single-attribute sparse array, assigning values to its elements selected by predicate using specified expression. This expression can access any of the array coordinates. The following example initializes a sparse diagonal matrix:
create array xyz <a1: double> [x=0:2,3,0, y=0:2,3,0]" build_sparse(xyz, 1.0*x+100.0*y, x=y)
6. Store data in an array - store()
The store() operator takes the result of an AFL expression, and creates a new array using the AFL query's results to populate it.
Signature:
store( input-array : aql_expression, array-name : string )
The first argument to the store() operator is an AFL expression. For example, an AFL query over existing arrays, or else a build() operator. The store() operator takes the results of this first query, and creates a new array named according to the value of the second argument array-name : string. The array built this way takes the same structure as the result of the AFL query.
For example:
CREATE ARRAY Build_Example_INT < A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ] store ( build (Build_Example_INT, 5*X + Y ), New_Build_Array) project ( New_Build_Array, A)
| y\x | 0 | 1 | 2 | 3 | 4 |
| 0 | (1) | (2) | (3) | (4) | (5) |
| 1 | (6) | (7) | (8) | (9) | (10) |
| 2 | (11) | (12) | (13) | (14) | (15) |
| 3 | (16) | (17) | (18) | (19) | (20) |
| 4 | (21) | (22) | (23) | (24) | (25) |
NOTES:
- In the example above, the inner build() query is exactly the same one used in the section describing the build() operator. This is intentional. The result of a build() is not automatically stored in the database. Rather, it is generated as a result, and must be stored.
7. Input - input()
Input is used to add new data incrementally to a previously loaded array. Input accepts a source array and an external file containing additional chunks and returns an array that is the combination of the chunks of the source array and the newly added chunks. Note that input does not scan the chunks of the source array, nor does it store the resulting array. To persist the new array formed by combining the source array with the new chunks, a user must use the store command. This is an important difference between input and load.
The external file format for input is similar to the file format for load. However, since input is expected to be used to extend a previously loaded array it can optionally specify the chunk offsets for each chunk being added if it contains only a portion of the array being extended.
Input applied to a regular array (one not created as updatable) does not allow previously loaded chunks to be overwritten or updated. This is consistent with SciDB's immutable array data model.
8. Updates - store()
Store can also be used to update chunks of an updatable array. If an array has been declared as an updatable array in the CREATE ARRAY command, store accepts chunks at dimension offsets that contain previously stored data. Such updates cause new array versions to be created. Each AFL update statement causes a new version to be created. When an array is removed, so are all its versions.
The example below shows the contents of file /tmp/dense2.txt used in the AFL examples below.
{0,3}
[
[(140), (150), (160) ],
[(240), (250), (260) ],
[(340), (350), (360) ]
];
{3,0}
[
[ (410), (420), (430) ],
[ (510), (520), (530) ],
[ (610), (620), (630) ]
]
The input file shown above contains 2 chunks at {0,3} and {3,0} that can be added to a previously loaded array as shown in the following AFL command.
store(merge(input(updarr, '../tests/basic/data/dense2.txt'), updarr), updarr)
Array versions are accessible from AFL queries using a timestamp based version specification. By default, the array name refers to the most recent version of the array is used.
scan(updarr) scan(updarr, now()-5)
Enumerate the versions of the updatable array using the versions operator.
project(versions(updarr), version_id)
To access older versions of the array, attach the suffix @timestamp to the array name. Use datetime to create a valid SciDB timestamp from a string representation of the timestamp. SciDB timestamp format is "12/29/10 12:57:24".
The following example shows how to access array versions using timestamps. This query returns the chunks of the array as of the timestamp '02/20/2002'.
scan(updarr@datetime('02/20/2002'))
9. Remove an array - remove()
Removes an array from the SciDB instance.
Signature:
remove( array-name : string )
The only argument to the remove() operator is the name of the array to be removed.
For example:
list ('arrays')
["Load_and_Append"),("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]
remove (Load_and_Append)
Query was executed successfully
list ('arrays')
[("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]
NOTES:
- Deleted array names can be re-used.
- See below for AFL operators to interrogate the SciDB instance about the shape, attributes and contents of an array.
- The remove() operator doesn't currently reclaim the storage manager space.
10. Rename an array - rename()
Rename an existing array in the database.
Signature:
rename( old-array-name : string, new-array-name : string )
The first argument is the name of the array to be renamed, and the second argument is the new array name.
Example:
CREATE ARRAY my_template < A: int32 > [ X=0:4,5,0, Y=0:4,5,0 ]
store ( build (my_template,iif(X=Y,1,0)), rename_example)
list ('arrays')
| ("my_template", "rename_example") |
project (rename_example, A)
| y\x | 0 | 1 | 2 | 3 | 4| |
| 0 | (1) | (0) | (0) | (0) | (0) |
| 1 | (0) | (1) | (0) | (0) | (0) |
| 2 | (0) | (0) | (1) | (0) | (0) |
| 3 | (0) | (0) | (0) | (1) | (0) |
| 4 | (0) | (0) | (0) | (0) | (1) |
rename (rename_example, post_rename)
list ('arrays')
| ("my_template", "post_rename") |
project (post_rename, A)
| y\x | 0 | 1 | 2 | 3 | 4| |
| 0 | (1) | (0) | (0) | (0) | (0) |
| 1 | (0) | (1) | (0) | (0) | (0) |
| 2 | (0) | (0) | (1) | (0) | (0) |
| 3 | (0) | (0) | (0) | (1) | (0) |
| 4 | (0) | (0) | (0) | (0) | (1) |
NOTES:
- In keeping with the precedent of the remove() operator, once a rename is done, the old array name can be re-used.
11. Change the attribute names, or dimension names of an array
The input array and template arrays should have the same numbers and types of attributes and the same numbers and types of dimensions.
This operator allows renaming an array or any of its attributes and dimensions. A single cast invocation can be used to rename multiple items at once -- one or more attribute names and/or one or more dimension names.
Note that the template-array need not have actual data and is only used as a template from which attribute names and dimension names are used. Note that the input array is not modified, rather, a new array is created as a result of the cast operator. To store this newly created array use the store() operator.
Signature: cast ( input-array, template-array )
Example:
create array M4x4 <a:int32, b:int32> [x=0:3,3,0, y=0:3,3,0] create array M_4x4 <x:int32, y:int32> [i=0:3,3,0, j=0:3,3,0] cast(M4x4, M_4x4)
Metadata Operators
AFL provides a number of mechanisms for getting information about the database it is managing.
1. List contents of SciDB database list()
The list() operator allows users to get a list of operators, types or the names of arrays in the SciDB instance.
Signature:
list ( code : string )
The code : string parameter value is one of the following: 'arrays', 'types', 'operators'. If called without any parameters, list will return a list of all arrays, created in SciDB.
Example:
list()
[("Reads"),("Load_and_Append"),("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]
list('arrays')
[("Reads"),("Load_and_Append"),("Load_Two"),("Example_One"),("Example_Four"),("Build_Example_INT")]
list('operators')
[("adddim"),("aggregate"),("apply"),("attribute_rename"),("attributes"),("average2"),("avg"),("between"),("build"),("build_sparse"),("cast"),("concat"),("count"),("count2"),("create_array"),("cross"),("deldim"),("dimensions"),("explain_logical"),("explain_physical"),("filter"),("help"),("input"),("inverse"),("join"),("list"),("load"),("load_library"),("lookup"),("max"),("max2"),("merge"),("min"),("min2"),("multiply"),("norm"),("project"),("regrid"),("remove"),("rename"),("repart"),("reshape"),("reverse"),("save"),("scan"),("sg"),("slice"),("sort"),("sort2"),("store"),("subsample"),("sum"),("sum2"),("transpose"),("unload_library"),("unpack"),("versions"),("xgrid")]
list('types')
[("bool"),("char"),("double"),("float"),("indicator"),("int16"),("int32"),("int64"),("int8"),("string"),("uint16"),("uint32"),("uint64"),("uint8")]
2. Getting dimensions of an array - dimensions()
Get a list of array dimensions.
Signature:
dimensions ( array-name : string )
The argument to the dimensions() operator is the name of the array. It returns an array with the following attributes: name, start, length, chunk-interval, chunk-overlap, low-boundary, high-boundary.
Examples:
dimensions (Example_Three)
[("X",0,10,10,0,0,10),("Y",0,10,10,0,0,10),("Z",0,100,100,0,0,10)]
For a regular array defined using a create array statement with fixed dimensions the start and length are set to the values specified in the CREATE ARRAY statement.
The low and high boundaries represent the current occupancy of cells within the array. Low-boundary represents the lowest dimension value occupied by any element in the array and high-boundary represents the highest dimension value occupied by an array element.
NOTE:
- The length field for an unbounded dimension is set to the maximal signed 64-bit integer, namely, 9223372036854775807.
- The low and high boundaries for a newly created array that does not yet have any data is also set to Undefined values recorded as the following special constants (MAXINT64 - 1) and -(MAXINT64-1) respectively. These constants are interpreted as "Undefined" for array dimension values into which no data has been loaded.
3. Getting attributes of an array - attributes()
Get a list of array attributes. Returns an array with the following attributes: name, type, and boolean flag. The boolean flag represents nullability of the attribute and is True if the attribute can be set to NULL.
Signature:
attributes ( array-name : string )
The argument to the attributes() operator is the name of the array.
Examples:
CREATE ARRAY Attr_Example_One < A: int32, B: string > [ I=0:9,10,0 ]
attributes ( Attr_Example_One )
[("A","int32",false),("B","string",false)]
CREATE ARRAY Attr_Example_Two < A: string, B: double, C: int32 > [ I=0:9,10,0 ]
attributes ( Attr_Example_Two )
[("A","string",false),("B","double",false),("C","int32",false)]
Attributes are only available for stored arrays. Result arrays returned by an AFL command may not be used as input to this operator.
4. Getting versions of an array - versions()
Get a list of array versions for an array which has been declared as updatable.
Signature:
versions ( array-name : string ) Example:
versions(updarr) [(1,"12/31/10 11:03:10"),(2,"12/31/10 12:22:17"),(3,"12/31/10 12:22:19")]
The output of the versions command is a list of versions, each of which has a version ID and a datestamp which is the date and time of creation of that version.
DML
In this section we review the syntax and the operators that make up the AFL Data Manipulation Language.
Overview of AFL DML
The DML portion of SciDB AFL adopts a functional programming language syntax over an underlying algebra of array operators. The general form of a DML expression in AFL DML will look like this.
operator_B ( operator_A ( array_name ), constants and expressions )
AFL query expressions should be read from the inside out. In this example, the first operator executed is the inner-most--operator_A()--and it is applied to an array stored in the SciDB database named 'array_name'. All SciDB AFL operators take (at least one) array inputs, and produce (exactly one) array output. In this example, the output of 'operator_A ( array_name )' is another logical array that becomes the input to the outermost operator, operator_B(). In addition to arrays, some operators in AFL can take as additional arguments a variety of constant values and expressions which are appropriate to their semantics.
NOTE: DML operators leave the input array unchanged. Only STORE and LOAD operators write results into database producing new array.
1. Selecting all the data in the array - scan()
Scan() returns the contents of the array whose name is passed in as an argument.
scan(my_array)
The formal signature of scan is:
scan( array_name : string ) [ as alias ] -> array
The 'scan()' operator expects a scalar string value as its only argument, and returns an array as a result of operation. This operator is only used to get a complete listing of the contents of an array stored in the SciDB database.
2. Selecting subset of data by array index - subsample()
Subsample() selects a block of cells from an input array.
The signature of subsample is:
subsample( input : array [ ,start_0' { ,start_i } ], start_n, [,end_0 { ,end_i } ], end_n)
The first argument to the subsample() operator is the array from which the block is to be sampled. The second parameter section specifies which block is to be extracted. There are as many pairs of array index values in the parameter block as the input : array has dimensions. For each dimension, the operator requires the minimum index value for each dimension first, followed by the maximum index value for each dimension. In other words, if the input : array has two dimensions, then the subsample() operator requires four start and end values.
Example:
Suppose 'my_array' is a 2-dimensional array with attributes A : int and B : float, and indexes x and y (this is our running example). The following table illustrates what the contents of such an array may look like. This result would be produced by the scan('my_array') query introduced in the previous section.
'my_array'
| y\x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 0 | (0,0.0) | (1,1.0) | (2,2.0) | (3,3.0) | (4,4.0) | (5,5.0) | (6,6.0) | (7,7.0) | (8,8.0) | (9,9.0) |
| 1 | (10,10.0) | (11,11.0) | (12,12.0) | (13,13.0) | (14,14.0) | (15,15.0) | (16,16.0) | (17,17.0) | (18,18.0) | (19,19.0) |
| 2 | (20,20.0) | (21,21.0) | (22,22.0) | (23,23.0) | (24,24.0) | (25,25.0) | (26,26.0) | (27,27.0) | (28,28.0) | (29,29.0) |
| 3 | (30,30.0) | (31,31.0) | (32,32.0) | (33,33.0) | (34,34.0) | (35,35.0) | (36,36.0) | (37,37.0) | (38,38.0) | (39,39.0) |
| 4 | (40,40.0) | (41,41.0) | (42,42.0) | (43,43.0) | (44,44.0) | (45,45.0) | (46,46.0) | (47,47.0) | (48,48.0) | (49,49.0) |
| 5 | (50,50.0) | (51,51.0) | (52,52.0) | (53,53.0) | (54,54.0) | (55,55.0) | (56,56.0) | (57,57.0) | (58,58.0) | (59,59.0) |
| 6 | (60,60.0) | (61,61.0) | (62,62.0) | (63,63.0) | (64,64.0) | (65,65.0) | (66,66.0) | (67,67.0) | (68,68.0) | (69,69.0) |
| 7 | (70,70.0) | (71,71.0) | (72,72.0) | (73,73.0) | (74,74.0) | (75,75.0) | (76,76.0) | (77,77.0) | (78,78.0) | (79,79.0) |
| 8 | (80,80.0) | (81,81.0) | (82,82.0) | (83,83.0) | (84,84.0) | (85,85.0) | (86,86.0) | (87,87.0) | (88,88.0) | (89,89.0) |
| 9 | (90,90.0) | (91,91.0) | (92,92.0) | (93,93.0) | (94,94.0) | (95,95.0) | (96,96.0) | (97,97.0) | (98,98.0) | (99,99.0) |
The first input to the subsample() operator is the source array. The rest of the inputs are the indexes of the subsample window. First, the lower bound indexes for all dimensions in the input array, are given, then the upper ones.
We can use the subsample() operator in the following way(s):
subsample( my_array, 1, 1, 2, 2)
This query will return a block of the input array where the initial (top-left) cell is [ 1,1 ] and the final (bottom-right) cell is at [ 2,2 ].
| y\x | 0 | 1 |
| 0 | (11,11.0) | (12,12.0) |
| 1 | (21,21.0) | (22,22.0) |
This query will return a block of the input array where the initial (top-left) cell is [ 4,4 ] and the final (bottom-right) cell is at [ 6,6 ].
subsample ( my_array, 4, 4, 6, 6)
| y\x | 0 | 1 | 2 |
| 0 | (44,44.0) | (45,45.0) | (46,46.0) |
| 1 | (54,54.0) | (55,55.0) | (56,56.0) |
| 2 | (64,64.0) | (65,65.0) | (66,66.0) |
subsample ( my_array, 4, 3, 6, 7)
| y\x | 0 | 1 | 2 | 3 | 4 |
| 0 | (43,43.0) | (44,44.0) | (45,45.0) | (46,46.0) | (47,47.0) |
| 1 | (53,53.0) | (54,54.0) | (55,55.0) | (56,56.0) | (57,57.0) |
| 2 | (63,63.0) | (64,64.0) | (65,65.0) | (66,66.0) | (67,67.0) |
3. Selecting a subset of data within a specified region - between()
The between operator accepts an input array and a set of coordinates specifying a region within the array. The output is an array of the same shape as input, where all cells outside of the given region are marked empty.
between has the following signature: between ( input: array, low-boundary-coordinate-1: integer, ..., low-boundary-coordinate-N: integer, high-boundary-coordinate-1: integer, ..., high-boundary-coordinate-N: integer )
Note that between is very similar to subsample, except that between does not change the shape of the array and preserves the position of the elements.
For example, consider the following four-by-four array of integers:
scan(fbf) [[(0),(1)],[(4),(5)]];[[(2),(3)],[(6),(7)]];[[(8),(9)],[(12),(13)]];[[(10),(11)],[(14),(15)]]
Between can be used to set all cells outside of the (1,1)->(2,2) region to empty:
between(fbf,1,1,2,2) [[(),()],[(),(5,true)]];[[(),()],[(6,true),()]];[[(),(9,true)],[(),()]];[[(10,true),()],[(),()]]
Note that a new boolean attribute referred to as the emptyFlag is appended to each element of the between result. This boolean attribute is set to true for all elements that are included in the between result and false for the rest of the elements of the array. This flag is not to be confused with the CREATE EMPTY ARRAY declared -- the latter represents a different storage format within the SciDB storage engine, whereas the former is a logical property that indicates the presence or absence of array elements in a dense array.
4. Selecting a subset of data by attribute value - filter()
The filter() operator 'filters' out data in the array based on the attribute values and the expressions of AFL. Suppose we want to filter out all the data from my_array, where the value of attribute A doesn't equal 75. We would use the following query:
filter( my_array, A <> 75)
The filter() operator marks all cells in the input which do not satisfy the predicate expression to 'empty'.
The filter() operator does two things. It appends a boolean attribute to the cells of the input array; the value of this new attribute reflects the truth value of the condition expression. Second, if the condition expression is false, the operator sets the cell to empty.
The signature of filter() is:
filter ( input : array, condition : expression )
where the 'condition : expression' is an arbitrary expression of AFL. The expression can be a simple condition like the one we used in the example above, but it also can contain a full sub-query with AFL operators. [see examples below]
The following examples illustrate both the functionality of the filter() operator, and also how operators in AFL can be combined into query expressions.
filter ( subsample ( my_array, 4, 4, 6, 6), A > 46 )
| y\x | 0 | 1 | 2 |
| 0 | (0,0.0,false) | (0,0.0,false) | (0,0.0,false) |
| 1 | (54,54.0,true) | (55,55.0,true) | (56,56.0,true) |
| 2 | (64,64.0,true) | (65,65.0,true) | (66,66.0,true) |
filter ( subsample ( my_array, 4, 4, 6, 6), A > 46 and A < 56)
| y\x | 0 | 1 | 2 |
| 0 | (0,0.0,false) | (0,0.0,false) | (0,0.0,false) |
| 1 | (54,54.0,true) | (55,55.0,true) | (56,56.0,true) |
| 2 | (0,0.0,false) | (0,0.0,false) | (0,0.0,false) |
5. Select from an array based on a pattern - lookup()
Lookup maps elements from the second array using dimensions of the first array as coordinates in the second array. The result array has the same shape as first array and the same attributes as second array.
Signature of lookup is:
lookup ( pattern-array, source-array )
Example:
lookup( pattern, dense )
6. Projecting (selecting) array attributes - project()
Project the input array on the specified attributes. all other attributes will be removed by this operator.
Signature:
project ( input-array : array, attribute-1 : string {, attribute-2 : string } )
Example:
project( subsample ( my_array, 4, 4, 6, 6 ), A)
| y\x | 0 | 1 | 2 |
| 0 | (44) | (45) | (46) |
| 1 | (54) | (55) | (56) |
| 2 | (64) | (65) | (66) |
project( apply ( subsample ( my_array, 4, 4, 6, 6 ), C, A + 20 ), C)
| y\x | 0 | 1 | 2 |
| 0 | (64.0) | (65.0) | (66.0) |
| 1 | (74.0) | (75.0) | (76.0) |
| 2 | (84.0) | (85.0) | (86.0) |
7. Joining two arrays - join()
Sometimes we need to combine two arrays into one, for example in order to add all values of attributes of the two arrays. The join() operator performs this function. Consider two input arrays:
subsample ( my_array AS M1, 4, 4, 6, 6 )
| y\x | 0 | 1 | 2 |
| 0 | (44,44.0) | (45,45.0) | (46,46.0) |
| 1 | (54,54.0) | (55,55.0) | (56,56.0) |
| 2 | (64,64.0) | (65,65.0) | (66,66.0) |
and
subsample ( my_array AS M2, 6, 6, 8, 8 )
| y\x | 0 | 1 | 2 |
| 0 | (66,66.0) | (67,67.0) | (68,68.0) |
| 1 | (76,76.0) | (77,77.0) | (78,78.0) |
| 2 | (86,86.0) | (87,87.0) | (88,88.0) |
The join() operator takes two inputs, and creates an output array where the cells in the output contain the concatenated attributes of each input array, for corresponding cells.
join( first_array, second_array )
The following example illustrates a join() between two different subsample() operators applied to the same underlying array. Note that the join() combines attributes from cells at dimension addresses on its immediate input arrays; not in the original inputs.
join ( subsample ( my_array AS M1, 4, 4, 6, 6 ), subsample ( my_array AS M2, 6, 6, 8, 8 ) )
| y\x | 0 | 1 | 2 |
| 0 | (44,44.0,66,66.0) | (45,45.0,67,67.0) | (46,46.0,68,68.0) |
| 1 | (54,54.0,76,76.0) | (55,55.0,77,77.0) | (56,56.0,78,78.0) |
| 2 | (64,64.0,86,86.0) | (65,65.0,87,87.0) | (66,66.0,88,88.0) |
Signature:
join ( input_one : array, input_two : array )
The schema produced by the join has the same dimension names as the left-arg to the join() operator.
8. Computing new values - apply()
Use the apply operator to compute new values from attributes and indexes of input arrays. The value(s) computed in the apply are appended to the attributes in the input array. Examples include:
Signature:
apply( input : array, new_attribute_name : string, value : expression )
For example:
This query will compute a new attribute C, whose value is given by the expression A + x. The resulting array will have all the old attributes, as well as the new attribute C.
apply ( subsample ( my_array, 4, 4, 6, 6 ), C, A + 20 )
| y\x | 0 | 1 | 2 |
| 0 | (44,44.0,64.0) | (45,45.0,65.0) | (46,46.0,66.0) |
| 1 | (54,54.0,74.0) | (55,55.0,75.0) | (56,56.0,76.0) |
| 2 | (64,64.0,84.0) | (65,65.0,85.0) | (66,66.0,86.0) |
Queries that need more than one apply() must nest them.
apply ( apply ( subsample ( my_array, 4, 4, 6, 6 ), C, A + 20 ), D, A + X)
| y\x | 0 | 1 | 2 |
| 0 | (44,44.0,64.0,44.0) | (45,45.0,65.0,45.0) | (46,46.0,66.0,46.0) |
| 1 | (54,54.0,74.0,55.0) | (55,55.0,75.0,56.0) | (56,56.0,76.0,57.0) |
| 2 | (64,64.0,84.0,66.0) | (65,65.0,85.0,67.0) | (66,66.0,86.0,68.0) |
9. Statistical summaries - aggregate()
This operator takes a list of dimension names as a group-by list, groups the array by the specified dimension and computes an aggregate expression for each group. The result of aggregate is an array with the shape corresponding to the specified group-by dimensions and two attributes: one containing the result of the aggregate function and the second - a counter of elements evaluated by the aggregate function.
An aggregate expression is arbitrary expression over the attributes or dimensions of the input array and also two extra attributes: aggregate-attribute and count attributes.
Signature:
aggregate( input-array : array, dimension-name-1 : string { , dimension-name-N : string }, result-name : string, initial-value : value, aggregate-expression : expression )
Example:
aggregate( my_array, x, count, 0, count + 1)
Several of the more common aggregate operations have their own, dedicated operator.
9.1. Arithmetic mean - avg()
Calculate the average value of the specified attribute in the array. The result is an array with single element containing average value. If the input array contains only one attribute, then attribute name can be omitted.
Signature:
avg ( input-array : array, attribute-name : string [, dimension-name 1 : string [, dimension-name 2 : string]] )
The first argument is the array to be aggregated over. The second argument is the name of the attribute created by the avg() operator. Additional dimension arguments are optional. If present, the list of dimensions specified in the avg() operator is used to perform a group-by operation before applying the average. The result is organized as an array with the remaining dimensions from the source array, after grouping has been performed based on the group-by dimensions.
Example: Support my_array is a 2 D array with attributes a: int32 and b: double, and dimensions x and y. The following command returns the average value of a over all elements of array. The second example below calculates the average value of b after grouping over all values of the dimension x. The second example below results in a one-dimensional array whose dimension is the remaining dimension y from the input.
avg( my_array, a) avg( my_array, b, x)
9.2 Count of cells - count()
Counts non-empty cells of the input array. When dimensions are provided they are used to do a group-by and a count per resulting group is returned.
Signature: count ( input-array : array [, dimension-name 1 : string [, dimension-name 2 : string]] )
Example:
count( my_array ) count( my_array, y )
9.3 Maximum value - max()
Calculate maximum of the specified attribute in the array. Result is an array with single element containing maximum of specified attribute.
If input array contains only one attribute, then attribute name can be omitted. Again, if dimensions are provided, they are used to produce groups and the maximum of each group is returned.
Signature:
max( input-array : array [, attribute-name : string] [, dimension-name 1 : string [, dimension-name 2 : string]] )
Example:
max( my_array, a) max( my_array, b, y)
9.4 Minimum value - min()
Calculates the minimum value of the specified attribute in the array. Result is an array with single element containing minimum of specified attribute. If input array contains only one attribute, then attribute name can be omitted. If an attribute list is specified, the result is an array with the remaining dimensions from the source array, and the minimum is evaluated over groups, where each group is the set of all elements matching the group by dimension(s).
Signature:
min( input-array : array [ , attribute-name : string ] [, dimension-name 1 : string [, dimension-name 2 : string]] )
Example:
min( my_array, a) min( my_array, a, x)
9.5 Sum of attribute values - sum()
Calculate sum of the specified attribute in the array. Result is an array with single element containing sum of specified attribute. If input array contains only one attribute, then attribute name can be omitted. Again, similar group by semantics as with the other aggregates, if a dimension list is present.
Signature:
sum( input-array : array [ , attribute-name : string ] [, dimension-name 1 : string [, dimension-name 2 : string]])
Example:
sum( my_array, a) sum( my_array, a, x)
10. Using the regrid operator - regrid()
Partition (divide) the cells in the input array into blocks, and for each block, apply a specific aggregate operation over the value(s) of some attribute in each block. The result is a new array with as many cells as there are sub-divisions, and a single attribute containing the (newly-named) value of the aggregate.
Split the input array into specified grid cells and calculate an aggregate function for all elements of such a cell. Result array has the same number of dimensions but length of each dimension is divided by correspondent the grid interval. It is required that chunk size is multiple to the splitting interval. So each chunk can be proceeded locally.
Result array contains two attributes: one containing result of aggregate function and second - counter of elements evaluated by aggregate function. Aggregate expression is arbitrary expression accessing attributes or dimensions of the input array and also two extra attributes: aggregate-attribute and count attributes.
Signature: regrid( input-array : array, grid-interval-1 : value, ..., grid-interval-N : value, aggregate-attribute-name : string, initial-value : value, aggregate-expression : expression )
Example:
regrid( testarr, 2, 2, sum, 0, sum + A)
Changing array shapes
List of AFL Operators that change the shape or dimensionality of an array. Note that in the current version of SciDB these operators accept string names of attributes/dimensions/array names. In the future SciDB will support path expressions instead of string representation for these.
1. Adding and Deleting dimensions - adddim() and deldim()
From time to time it is necessary to change the structure of an array by adding or deleting a dimension.
The adddim() operator prepends the existing dimensions for an array with a new dimension, whose name is supplied as the second argument to the operator.
The new dimension will have start = 0, length = 1, chunkInterval = 1, overlap = 0.
Deldim() deletes the left-most dimension from the array. Deleted dimension should have size 1.
Signature:
adddim ( input-array : array, new-dimension-name : string )
deldim ( input-array : array)
Examples:
adddim(matrix, timestamp) deldim(subsample(matrix, 1, 100, 1, 200))
2. Concatenate two arrays - concat()
Concatenate two arrays. Array should have the name number of dimensions. Concatenation is performed by the left-most coordinate. All other dimensions should be the same. And left-most dimension of both arrays should have fixed size and same chunking schema. Also arrays should have the same attributes.
Signature: concat( left-array : array, right-array : array )
Example:
concat( testarr1, testarr2 )
3. Merge attributes - merge()
Merge two or more arrays. All arrays should have the same attributes and dimensions. Merge operator combines elements from the input array in such way: if element of first array is non empty, then it is taken, otherwise correspondent element of the second array is inspected and so on...
Signature: merge ( input-array-1 : array, input-array-2 : array , ... )
Example:
merge( testarr1, testarr2 )
4. Reshape an array - reshape()
Change the shape of an array to the rank and dimensions of another array.
Signature:
reshape ( input-array : array, target-array-name : string )
NOTES:
- The array referred to by the target-array-name : string parameter should have the same number of attributes as the array referred to as input-array : array.
- The arrays must have fixed size dimensions. That is, reshape() for unbounded arrays is not supported.
- Size of the input and target array should be the same. That is, both arrays should have the same number of cells. For example, it is possible to reshape 2x2x2 array to 4x2, but not 3x3 to 2x2.
- Target array should be previously created using "create array" operator.
Example:
reshape(testarr, testarr2)
5. Re-partition an array - repart ()
Change partitioning (chunking) of the array. Target array should have the same number of attributes and dimensions, but chunk size may be different. Target array should be previously created using "create array" operator and is treated as a template from which draw the array shape definition from. Repart returns as array whose attribute definition and content is taken from the first array and dimension definitions are from the second array.
The signature of the repart operator is:
repart ( input-array, target-array )
Example: repart(input, templ)
6. Expanding an element into a grid - xgrid()
The signature of the xgrid operator is:
xgrid ( input-array, scale-1, ..., scale-N ) Use this operator to scale an input array by repeating cells of the original array specified number of times. This operator can be considered as the inverse of the regrid operator. While regrid splits an input array into "hypercubes" and calculates an aggregate function over them, xgrid produces a hypercube from the single element of the input array by repeating that element.
Example:
xgrid(matrix, 2, 2)
7. Remove dimension(s) - slice()
Get a slice of the array. Result is a slice of the input array corresponding to the given coordinate value. Number of dimensions of the result array is equal to the number of dimensions of input array minus number of specified dimension, and the coordinate value should be a valid dimension value of the input array.
Signature: slice ( input-array : array, dimension-name-1 : string, coordinate-value-1 : value { ,dimension-name-N : string, coordinate-value-N : value } )
Example: Assuming that testarr is a 2-dimensional array with dimensions x and y, the following query returns a slice, ie., a 1-dimensional array at x = 2. This is equivalent to a filter with a filter expression of x = 2 and reduced dimensionality compared to the input array.
slice(testarr, x, 2)
8. Sort array by attribute - sort()
Sort one-dimensional array by one or more attributes.
Signature: sort ( input-array : array, [-] attribute-number-1 : integer { , [-] attribute-number-I : integer } )
Example:
sort(testarr, 1, 2)
Unadorned attributes imply ascending order. Pre-fixing an attribute's name with a '-' symbol tells the sort() operator to sort in descending order.
sort(testarr, -1, 2)
9. Representing arrays as single dimensional array - unpack()
Unpack array to single dimensional array creating new attributes to represent source array dimension values. Result array has a single zero-based dimension and arguments combining attributes of the input array. The single dimension of the result array is the second argument of the command.
Signature: unpack ( input-array : array, row )
Example:
With 'my_array' as a square matrix, defined as
CREATE ARRAY my_array <a:int32> [x=0:2,3,0, y=0:2,3,0]
and containing the following data
| y\x | 0 | 1 | 2 |
| 0 | (1) | (2) | (3) |
| 1 | (4) | (10) | (6) |
| 2 | (7) | (8) | (9) |
Then the unpack operator produces the following result.
unpack( my_array, row)
Output: (NOTE: the following CREATE ARRAY statement is meant to illustrate what the result array looks like. You do not need to use this CREATE ARRAY statement to use the unpack() operator.)
CREATE ARRAY output < x: int32, y: int32, a: int32 > [ I=0:8,9,0 ];
| I | (0,0,1) | (0,1,2) | (0,2,3) | (1,0,4) | (1,1,10) | (1,2,6) | (2,0,7) | (2,1,8) | (2,2,9) |
The format for the output of the unpack() operator is to begin by enumerating the dimensions of the input, and then to append the attribute values. Chunk size -- ie., the number of elements per chunk is preserved from input to output. The resulting one-dimensional array has chunk dimension equal to the count of cells in each input chunk.
NOTE:
- The unpack operator explains why arrays cannot have an attribute name and a dimension name that are the same.
- unpack only supports arrays whose size is evenly divisible by chunk size in every dimension. You cannot unpack an array A <a:int32> [x=0:100,30,0]; but you can unpack an array A <a:int32> [x=0:100,20,0].
- The result of unpack is an array with a single chunk. The resulting chunk contains as many cells as there are in the original array.
10. Transpose an array - transpose()
Transpose matrix by inverting the order of its dimensions. This is only supported for an input array with two dimensions. Signature: transpose ( input-array )
transpose(testarr)
11. Reverse array elements - reverse()
"Invert" or reverse the elements of an array by reversing the values of each dimension.
Signature: reverse ( input-array ) Assume that dense is a 2-dimensional array with the following definition and contents.
Array definition of dense is as follows.
CREATE ARRAY dense <a: int32>[x=0:5,3,0, y=0:5,3,0]
scan(dense) [ [(66),(65),(64)], [(56),(55),(54)], [(46),(45),(44)] ]; [ [(63),(62),(61)], [(53),(52),(51)], [(43),(42),(41)] ]; [ [(36),(35),(34)], [(26),(25),(24)], [(16),(15),(14)] ]; [ [(33),(32),(31)], [(23),(22),(21)], [(13),(12),(11)] ]
reverse returns an array in which each dimension is the reverse of the corresponding dimension of the source array.
reverse(dense) [ [(11),(12),(13)], [(21),(22),(23)], [(31),(32),(33)] ]; [ [(14),(15),(16)], [(24),(25),(26)], [(34),(35),(36)] ]; [ [(41),(42),(43)], [(51),(52),(53)], [(61),(62),(63)] ]; [ [(44),(45),(46)], [(54),(55),(56)], [(64),(65),(66)] ]
Debug and Internal commands
1. Explain plan - explain_logical() and explain_physical()
The operators explain_logical and explain_physical can be used to emit a human-readable plan string for a particular query without running the query itself. SciDB first constructs a logical plan, optimizes it and then translates it into a physical plan.
The signature for explain_logical and explain_physical are:
explain_logical( query: string, language: string) explain_physical ( query: string, language: string)
where language corresponds to the language of the query string and is either afl or aql.
Example:
explain_physical('store(join(a, b), c)','afl')
Note that the output of the explain operators is a single-element array that contains the plan string. The plan text format is intended for SciDB developers and experts and is subject to change in the future.
2. Scatter/Gather? an array - sg()
The SG operator is used internally to redistribute or repartition the elements of an array between SciDB processing nodes. SG by itself does not perform any computation and its use in user queries can lead to inefficiencies and is discouraged.
sg redistributes chunks of the input array across cluster nodes according to a specified policy. sg is not a user-visible operator, but it is heavily used by the system internally. It is a physical operator for data movement within the system and users are not expected to invoke it directly. For example, the scidb optimizer inserts an sg operator after load() to redistribute array data using round robin distribution. The optimizer may also insert sg operators to assist with aggregations, operators that don't preserve shape or operators that don't preserve distribution.
The signature of 'sg' is: sg ( input-array, [ partitioning schema [, node id [, result array name [, store_flag [, offset_x, offset_y, ... ] ] ] ] ])
The following table describes the arguments to sg().
| arg name | Description | Comment |
| input-array | array to be redistributed. | May also be the output of another operator like scan, filter, etc. |
| partitioning schema | integer constant specifying partitioning schema. | Possible values are shown in the table below. |
| node number | a node number to send all chunks. | Used only with psLocalNode. Ignored for other partitioning schemas. |
| result array | a name of result array. | If omitted, the result of sg is nameless and temporary with query duration. |
| store_flag | a boolean indicator used to specify whether sg should store the result array. | By default, true if result array name is given. If store_flag is true, the result of sg is persisted as a new array with given name. If store_flag is false, the result is a temporary named array that can be referenced elsewhere in the query. |
| offset | a distribution offset vector. | For internal use. |
The partition schema is one of the following types.
| partition schema | Description | Comment |
| 0 | psReplication | replicate array on every node of cluster |
| 1 | psRoundRobin | distribute chunks by round robin |
| 2 | psLocalNode | send every local chunk to specified node |
| 3 | psByRow | distribute every chunk to node according to chunk coordinate of dimension 0 |
| 4 | psByCol | distribute every chunk to node according to chunk coordinate of dimension 1 |
Examples: Query to distribute array by round robin onto every node:
sg(seq_info, 1, -1, seq_info_sg)
Query to gather whole array on node 0 in memory array (will not be stored on disk):
sg(seq_info_sg, 2, 0)
Matrix and Vector Operations
1. Multiply two matrices - multiply ()
Multiply two matrices. Matrices should have two dimensions and a single attribute of one of the integer or floating point data types.
The signature of multiply is: multiply ( left-matrix, right-matrix )
Example:
multiply(matrix1, matrix2)
2. Normalize vector elements - norm ()
Calculate the L2-norm of a vector by dividing each element by the square root of the sum of squares of the elements.
Signature: norm ( input-array, attribute-name )
Example:
scan(v10) [(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)] norm(v10) [(0.0509647),(0.101929),(0.152894),(0.203859),(0.254824),(0.305788),(0.356753),(0.407718),(0.458682),(0.509647)]