wiki:Docs/Release_0.8/ArrayQueryLanguage_0.8

AQL: Array Query Language Reference

AQL, the array query language, is a high-level declarative language for working with SciDB arrays. It is modeled after the popular SQL language for relational databases, but uses a different data model and query set compared with standard relational databases.

The AQL language includes two classes of queries:

  • DDL -- Queries to create array definitions and load data
  • DML -- Queries to access and operate on array data

The DDL commands in AQL and AFL are very similar, however AQL uses a declarative SQL-style invocation.

This document provides a summary of AQL's DDL queries along with a complete review of AQL's DML. Please refer to the AFL document for a detailed discussion of the array data model and data load formats.

As of SciDB version 0.8, AQL represents a partial subset of SciDB capabilities available today whereas the AFL interface provides the entire set of operators available in SciDB since AFL covers both user and developer operations.

Both AQL and AFL statements are handled by the SciDB query compiler which translates and optimizes incoming statements into execution plan. The plan is expressed in terms of physical "operators" many of which have direct analogs in the AFL language.

DDL

1. CREATE ARRAY command

AQL and AFL use the same command syntax for the CREATE ARRAY command.

Refer to the AFL document or wiki page for a detailed review of the array data model, data load formats, and the CREATE ARRAY signature and parameters.

We will use the following CREATE ARRAY example throughout the rest of this document.

CREATE ARRAY A <x: double NOT NULL DEFAULT 1.0, err: double> [i=0:99,10,0, j=0:99,10,0]

2. Loading data

Loading data into arrays is done using the load command.

The syntax for LOAD is:

LOAD array_name : string FROM file_path : string

Example:

LOAD A FROM "A.txt"

The data file formats supported by SciDB are documented in the AFL document. See the section on load() in the AFL document.

3. Deleting arrays

Dropping arrays is done with the drop command.

DROP ARRAY array_name

For example to drop array A we run the following command.

DROP ARRAY A

DML Query Syntax

1. SELECT attributes FROM

As in SQL, AQL SELECT is used to select one or more attributes from an array.

Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result"

For example:

SELECT * from A

This is a basic query the selects all attributes from the whole array A.

SELECT a1, a2 from A, B

This is a SELECT query that selects two attributes from an array result of 'A, B'. We will discuss this syntax in a later section.

SELECT a1 INTO C FROM A

This query selects the attribute a1 from the array A and stores the result in array C.

2. WHERE Clause

A WHERE clause is used to filter cells from the input array.

Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result" [ WHERE expr ]

For example:

SELECT * FROM A WHERE err < 0.025

The above query selects all cells in the array whose error is less than 0.025. The shape of the array is preserved by the WHERE clause. The user will see a result with the same shape as A, except the cells that don’t match the condition of the WHERE clause will be empty.

Note that the where clause can be used to check if a particular attribute is NULL. This is done with the 'IS NULL' and 'IS NOT NULL' construct.

SELECT a FROM A WHERE a IS NOT null

We support SQL semantics for null values in SciDB. Hence, any comparison or use of null values in expressions will evaluate to NULL, except for the ‘is null’ or 'is not null' predicates. For a detailed discussion of NULL values in SciDB please refer to the section on "Null and Default Attributes" in the AFL document.

3. JOIN Clause

A join combines two or more arrays typically as a pre-processing step for subsequent operations. The following example illustrates SciDB JOIN.

Syntax:

SELECT "expr_list | *" [INTO name] FROM joinexpr [ WHERE expr ]

joinexpr := namedExpr ( ("join" namedExpr) | ("," namedExpr))*

Consider a second array B, that has the same shape as array A declared earlier:

CREATE ARRAY B <y: double, err2: double> [i=0:99,10,0, j=0:99,10,0]

The following query returns an array by joining arrays A and B.

SELECT * FROM A, B

This join is a natural join of array A with B. The resulting array will have 2 dimensions, whose names are from the left most argument in the list of arrays being joined. Each cell in the resulting array has 4 attributes, x, err from A and y, err2 from B. Natural join is useful when you need to merge two arrays with the same shape in order to perform further processing on their attributes.

The following example joins two arrays in order to compute the sum of two attributes for each element.

SELECT A.x + B.y FROM A,B

4. Aggregates

AQL supports the following five built-in aggregate functions -- count, sum, avg, min, and max. The scalar aggregate functions in AQL take an input array and produce a scalar result value for it which is the aggregate function applied to the entire array. For example:

SELECT max (SELECT a1 FROM a)

Each array aggregate function (sum, min, max, avg) accepts a single attribute of appropriate type. For example if we want to compute the maximum value of an attribute, that attribute type must support inequality and equality functions. Examples of SciDB data types that can be used for these aggregate functions are integers and floating point types.

count here is an aggregate function that takes an array as input and returns a scalar value in its output.

Below is a query that counts all the measurements of array A:

SELECT count(A)

While all aggregates can be calculated in this manner over the entire array typically on one attribute, the count function is an exception. The attribute name is optional input into the count function. When count() is used without any arguments, the result is a count of the number of non-empty elements in the array. When supplied with an attribute name, count counts the number of non-null occurrences of that attribute within the array.

5. GROUP BY Aggregates

AQL aggregates can also be used with the GROUP BY clause. The syntax for a group by aggregate is as follows.

SELECT "namedExprList | *" [ INFO name] FROM joinexpr [ WHERE expr ] [ groupBy ]

groupBy := "group by" dimensionList "as" pathExpr

dimensionList := dim ("," dim)*

AQL GROUP BY takes a list of dimensions as a "group by" list, groups the array by the specified dimensions and computes the aggregate function for each group. The result of aggregate is an array with same dimensions as the source array without the group by dimensions. The result array has a single attribute which is the aggregate result per group.

Example 1. count group by

Suppose we want to first group measurements by the first dimension, then calculate the count over each of the groups, the AQL query will look like this:

SELECT count(part)
FROM A
GROUP BY i AS part

The group by clause here creates a nested array for each group and names the nested array part. Part is a one-dimensional array corresponding to a given value of dimension i and attributes of the input array A.

Example 2. group by avg over an attribute

Now suppose we need to compute the average measurement from A, grouped by i we issue the following AQL query which selects the attribute to be used as input to the average.

SELECT avg(SELECT x FROM part)
FROM A
GROUP BY i AS part

6. Nested subqueries

In Release 0.8 nested queries can either be simple aggregate queries in the SELECT clause, or nested queries in the FROM clause. Other kinds of nested queries will be supported in forthcoming releases

With Release 0.8 nested queries we select partitions where the average value of measurements is greater than 10:

SELECT * FROM
( 
SELECT avg (SELECT x FROM part)
FROM A
GROUP BY i AS part
)
WHERE avg > 10

NOTES:

  1. Queries that return nested arrays are not supported. For example, the following query returns nested partitions: SELECT * FROM mult_1 GROUP BY x

7. Updates and Versions

AQL arrays that are updatable can be updated using the following command:

UPDATE array SET "attr = expr", ... [ WHERE condition ]

For example if we want to add 1.0 to each measurement in array A and add 0.05 to each error, we can write the following query:

UPDATE A SET x = x+1, err = err + 0.05

Update can be used with a WHERE clause, suppose we want to increase the measurement only if the error is smaller than 0.05:

UPDATE A SET x = x+1 WHERE err < 0.05

Note that the UPDATE clause can be used to update non-null elements using the following example:

SELECT * FROM A
[(), (2), (3)]
UPDATE A SET a = a + 1 WHERE a IS NOT NULL 
[(), (3), (4)] 
UPDATE A SET a = 22222 WHERE a IS NULL 
[(22222),(3),(4)] 

For a definition of NULLs in the SciDB array data model please refer to the AFL documentation for this release.

SciDB saves all versions of the data, so the user has an option to query data as it was in the database at a specific time. This is done by appending a @ datetime prefix after the name of the array. For example, the following query returns data as of November 11 2010:

SELECT * FROM A @ datetime('11-11-2010')

For convenience, SciDB includes a now() function, so one can write a query to retrieve all data that was in the array 1000 miliseconds prior to current date and time:

SELECT * FROM A @ now() - 1000

Any expression yielding a datetime value is supported after the @ operator.