wiki:Docs/Release_11.06/ArrayQueryLanguage_11.06

AQL: Array Query Language Reference

AQL, the array query language, is a high-level declarative language for working with SciDB arrays. It is modeled after the popular SQL language for relational databases, but uses a different data model and query set compared with standard relational databases.

The AQL language includes two classes of queries:

  • DDL -- Queries to create array definitions and load data
  • DML -- Queries to access and operate on array data

The DDL commands in AQL and AFL are very similar, however AQL uses a declarative SQL-style invocation.

This document provides a summary of AQL's DDL queries along with a complete review of AQL's DML. Please refer to the AFL document for a detailed discussion of the array data model and data load formats.

As of this release, AQL represents a partial subset of SciDB capabilities available today whereas the AFL interface provides the entire set of operators available in SciDB since AFL covers both user and developer operations.

Both AQL and AFL statements are handled by the SciDB query compiler which translates and optimizes incoming statements into execution plan. The plan is expressed in terms of physical "operators" many of which have direct analogs in the AFL language.

DDL

1. CREATE ARRAY command

AQL and AFL use the same command syntax for the CREATE ARRAY command.

Refer to the AFL document or wiki page for a detailed review of the array data model and array types, and the CREATE ARRAY signature and parameters.

We will use the following CREATE ARRAY example throughout the rest of this document.

CREATE ARRAY A <x: double NOT NULL, err: double> [i=0:99,10,0, j=0:99,10,0];

2. Loading data

Loading data into arrays is done using the load command.

The syntax for LOAD is:

LOAD array_name : string FROM file_path : string;

Example:

LOAD A FROM 'A.txt';

The data file formats supported by SciDB are documented in the AFL document. The file path 'A.txt' is assumed to be a path name relative to the coordinator's working directory.

See the section on load() in the AFL document. AQL only supports coordinator load (corresponding to nodeid=0 in the AFL load() command).

3. Deleting arrays

Dropping arrays is done with the drop command.

DROP ARRAY array_name;

For example to drop array A we run the following command.

DROP ARRAY A;

The array and all it's versions are dropped. Any associated mapping arrays created for non-integer dimensions are also dropped. Most AQL commands do not create or accept arrays with non-integer dimensions in this release.

Note that in this release, we do not reclaim storage from deleted arrays.

DML Query Syntax

1. SELECT attributes FROM

As in SQL, AQL SELECT is used to select one or more attributes from an array.

Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result"

For example:

SELECT * FROM A;

This is a basic query the selects all attributes from the whole array A. SELECT lists can also reference individual attributes and dimensions, as well as constants and expressions. Expressions are discussed later in this document.

SELECT a1, a2 FROM A, B;

This is a SELECT query that selects two attributes from an array result of 'A, B'. We will discuss the "," operator, used for array JOINs and their syntax in a later section.

SELECT a1 INTO C FROM A;

This query selects the attribute a1 from the array A and stores the result in array C. If the array C does not exist, an array with the appropriate schema is created for it. This command updates an existing array if one exists.

If an array exists in SciDB, but its schema is not identical to that of the result of the FROM, a redimension(), or repart() is applied to the result before storing it into the destination array. Any additional mapping arrays required for non-integer dimensions of C are automatically created.

In place of a table A or a join result, SELECT can also be used to select data from an AFL operator result. For a full list of AFL operators, please refer to the "SciDB Array Functional Language Reference" document. Shown below is such an example with the "show" operator which return an array containing the schema of A.

Example:

SELECT * FROM show(A);

2. WHERE Clause

A WHERE clause is used to filter cells from the input array.

Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result" [ WHERE expr ]

For example:

SELECT * FROM A WHERE err < 0.025;

The above query selects all cells in the array whose error is less than 0.025. The shape of the array is preserved by the WHERE clause. The user will see a result with the same shape as A, except the cells that don’t match the condition of the WHERE clause will be empty.

Note that the where clause can be used to check if a particular attribute is NULL. This is done with the 'IS NULL' and 'IS NOT NULL' construct.

SELECT a FROM A WHERE a IS NOT null;

We support SQL semantics for null values in SciDB. Hence, any comparison or use of null values in expressions will evaluate to NULL, except for the ‘is null’ or 'is not null' predicates. For a description of NULL values in SciDB please refer to the section on "CREATE ARRAY" in the AFL document. The WHERE clause refers to the inputs of a SciDB array.

In contrast, EMPTY elements of an array are not evaluated by the WHERE clause.

3. AQL Expressions

AQL expressions in the SELECT list or the WHERE clause are standard expressions over the attributes and dimensions of the array. They can also refer to special built-in functions supported by SciDB. This list is documented in the AFL operators reference (See "Array Functional Language Reference") and shown here in the following example.

SELECT name, high('vector'), low('vector'), dim.high, dim.low FROM dimensions(vector) AS dim;

4. Natural JOIN

A join combines two or more arrays typically as a pre-processing step for subsequent operations. The following example illustrates SciDB JOIN.

Syntax:

SELECT "expr_list | *" [INTO name] FROM joinexpr [ WHERE expr ]

joinexpr := namedExpr ( ("," namedExpr) )*

Consider a second array B, that has the same shape as array A declared earlier:

CREATE ARRAY B <y: double, err2: double> [i=0:99,10,0, j=0:99,10,0];

The following query returns an array by joining arrays A and B.

SELECT * FROM A, B;

This join is a natural join of array A with B. This is only possible with two arrays with matching shape. The arrays must have the same number of dimensions, matching dimension start and endpoints, and same chunk size. But dimension names need not match.

The dimensions of the resulting array are the same dimensions as the inputs but with generated names. The resulting array also has combined attributes of all inputs.

Each cell in the resulting array in this example has 4 attributes, x, err from A and y, err2 from B. Natural join is useful when you need to merge two arrays with the same shape in order to perform further processing on their attributes.

The following example joins two arrays in order to compute the sum of two attributes for each element.

SELECT A.x + B.y FROM A,B;

5. JOIN ON

AQL JOIN ON is used to join two compatible arrays on specified dimensions or attributes.

The JOIN ON returns an array that is the cross product join of its inputs. If the JOIN ON predicates include dimensions, an afl cross join is used. Only JOIN ON equality predicates are supported. Predicates must use compatible dimensions from the input arrays -- size, types and chunking.

Each cell in the output array contains all elements from the join inputs from corresponding dimension values.

We currently support either dimension-dimension or attribute-attribute JOINs but not attribute-dimension JOINs. Furthermore, the join predicate list must contain only dimensions or only attributes - not both.

5.1. Dimension-dimension join

Here two or more arrays can be joined on dimensions. This is translated by AQL into AFL cross-joins. JOIN ON dimension predicates are compiled into the dimension pairs list for cross-join. Any aliasing required is performance by the AQL-to-AFL compiler.

Example:

set lang aql;
DROP array v1;
DROP array k;
CREATE array v1 <a:double, b:int64> [x=1:3,3,0, y=1:3,3,0];
CREATE empty ARRAY k <c: string> [x=1:3,3,0];


set lang afl;
store(join(build(<a:double> [x=1:3,3,0,y=1:3,3,0], x+y), build(<b:int64> [x=1:3,3,0,y=1:3,3,0], x*y)), v1);
[[(2,1),(3,2),(4,3)],[(3,2),(4,4),(5,6)],[(4,3),(5,6),(6,9)]]

store(filter(build(<c: string> [x=1:3,3,0], 'addr_'+string(x)+string(x)), 1), k);
[("addr_11",true),("addr_22",true),("addr_33",true)]
set lang AQL;
DROP array res;
SELECT * INTO res FROM v1 JOIN k ON v1.x = k.x;

[[(2,1,"addr_11",true),(3,2,"addr_11",true),(4,3,"addr_11",true)],[(3,2,"addr_22",true),(4,4,"addr_22",true),(5,6,"addr_22",true)],[(4,3,"addr_33",true),(5,6,"addr_33",true),(6,9,"addr_33",true)]]

The output of the join on statement above is:

AQL% SELECT * FROM show(res);
[("res<a:double NOT NULL,b:int64 NOT NULL,c:string NOT NULL,empty_indicator:indicator NOT NULL> [x=1:3,3,0,y=1:3,3,0]")]

5.2 Attribute-attribute JOIN

The JOIN ON predicate may have a list of equality predicates based on array attributes. If this is done, the inputs are first transformed into temporary stored arrays using the redimension_store() operator and then joined using AFL cross_join(). Any additional arrays created to execute the JOIN are removed at the end of query execution.

The following shows an example of attribute-attribute join using two 1-dimensional vectors.

CREATE array vector3 <a:int64, b:int64> [i=0:8,9,0];
# Load some data into it. 
SELECT * FROM vector3 A JOIN vector3 B ON A.a = B.a;

The above query produces a result with the following schema:

Result schema: tmp_5@1 <a:int64 NOT NULL, b:int64 NOT NULL, b_2:int64 NOT NULL, empty_indicator:indicator NOT NULL>[i=*:*,10,0, i_2=*:*,10,0]
SELECT * FROM vector3 A JOIN vector3 B ON A.a = B.b and A.b = B.a;

which produces a result with the following schema:

<a:int64 NOT NULL, b:int64 NOT NULL, empty_indicator:indicator NOT NULL>[i=*:*,10,0, i_2=*:*,10,0]

6. Aggregates

AQL supports the following built-in aggregate functions -- count, sum, avg, min, max, var, and stdev. The scalar aggregate functions in AQL take an input array and produce a scalar result value for it which is the aggregate function applied to the entire array. For example:

SELECT max (SELECT a1 FROM a);

Each array aggregate function (sum, min, max, avg, var, stdev) accepts a single attribute of appropriate type. For example if we want to compute the maximum value of an attribute, that attribute type must support inequality and equality functions. Examples of SciDB data types that can be used for these aggregate functions are integers and floating point types.

count here is an aggregate function that takes an array as input and returns a scalar value in its output.

Below is a query that counts all the measurements of array A:

SELECT count(A);

While all aggregates can be calculated in this manner over the entire array typically on one attribute, the count function is an exception. The attribute name is optional input into the count function. When count() is used without any arguments, the result is a count of the number of non-empty elements in the array. When supplied with an attribute name, count counts the number of non-null occurrences of that attribute within the array.

7. GROUP BY Aggregates

AQL aggregates can also be used with the GROUP BY clause. The syntax for a group by aggregate is as follows.

SELECT "namedExprList | *" [ INFO name] FROM joinexpr [ WHERE expr ] [ groupBy ];

groupBy := "group by" dimensionList "as" pathExpr

dimensionList := dim ("," dim)*

AQL GROUP BY takes a list of dimensions as a "group by" list, groups the array by the specified dimensions and computes the aggregate function for each group. The result of aggregate is an array with same dimensions as the source array without the group by dimensions. The result array has a single attribute which is the aggregate result per group.

The result of a "GROUP BY x, y" is an array with only x, y dimensions and a new attribute which is the corresponding partition. This partition can be accessed by sub-queries or expressions in the SELECT aggregate clause. The following examples explain the GROUP BY aggregate.

Example 1. count group by

Suppose we want to first group measurements by the first dimension, then calculate the count over each of the groups, the AQL query will look like this:

SELECT count(part)
FROM A
GROUP BY i AS part;

The group by clause here creates a nested array for each group and names the nested array part. Part is a one-dimensional array corresponding to a given value of dimension i and attributes of the input array A.

Example 2. group by avg over an attribute

Now suppose we need to compute the average measurement from A, grouped by i we issue the following AQL query which selects the attribute to be used as input to the average. AQL also supports standard SQL-like syntax for computing the aggregate over array groupings. In the current release of SciDB, the AQL GROUP BY must provide a name for the partition attribute created in the GROUP BY aggregate. See example below.

SELECT avg(x)
FROM A
GROUP BY i AS T;

8. Nested sub-queries

In this release nested queries can either be simple aggregate queries in the SELECT clause, or nested queries in the FROM clause. Other kinds of nested queries will be supported in forthcoming releases

In the example below, we select partitions whose average value is greater than 10:

SELECT * FROM
( 
SELECT avg (SELECT x FROM part)
FROM A
GROUP BY i AS part
)
WHERE avg > 10;

9. Updates and Versions

AQL arrays that are updatable can be updated using the following command:

UPDATE array SET "attr = expr", ... [ WHERE condition ];

For example if we want to add 1.0 to each measurement in array A and add 0.05 to each error, we can write the following query:

UPDATE A SET x = x+1, err = err + 0.05;

Update can be used with a WHERE clause, suppose we want to increase the measurement only if the error is smaller than 0.05:

UPDATE A SET x = x+1 WHERE err < 0.05;

Note that the UPDATE clause can be used to update non-null elements using the following example:

SELECT * FROM A;
[(), (2), (3)]
UPDATE A SET a = a + 1 WHERE a IS NOT NULL; 
[(), (3), (4)] 
UPDATE A SET a = 22222 WHERE a IS NULL;
[(22222),(3),(4)] 

For a definition of NULLs in the SciDB array data model please refer to the AFL documentation for this release.

SciDB saves all versions of the data, so the user has an option to query data as it was in the database at a specific time. This is done by appending a @ datetime prefix after the name of the array. For example, the following query returns data as of November 11 2010:

SELECT * FROM A @ datetime('11-11-2010 09:00:00');

If multiple versions have the same datetime, the most recent version, i.e., the one with the highest version number is returned. The timestamp of a version is the time on the SciDB server, reported in UTC timezone.

For convenience, SciDB includes a now() function, so one can write a query to retrieve all data that was in the array 1000 miliseconds prior to current date and time:

SELECT * FROM A @ now() - 1000;

Any expression yielding a datetime value is supported after the @ operator.

10. Aliases

AQL provides a way to refer to arrays in the query via aliases. These are useful when using the same array repeatedly in an AQL statement, or when abbreviating a long array name. Aliases are created by adding an "as" to the array name, followed by the alias. Future references to the array must use the alias.

Example:

SELECT * FROM vector3 AS A;

Once an alias has been assigned, all attributes and dimensions of the array can use the fully qualified name using the dotted naming convention.

SELECT A.x + 10 from vector3 AS A;

For example:

SELECT pow(sin(a1), 2) as p1, pow(cos(a1), 2) as p2 FROM A;