Contents:
AQL: Array Query Language Reference
The AQL language includes two classes of queries:
- Data Definition Language (DDL) : queries to define arrays and load data.
- Data Manipulation Language (DML) : queries to access and operate on array data.
This document provides reference material for AQL.
DDL
CREATE ARRAY
The CREATE ARRAY command creates an array with specified name and schema.
The rest of this document will refer to the following CREATE ARRAY example:
CREATE ARRAY A <x: double NOT NULL, err: double> [i=0:99,10,0, j=0:99,10,0];
This creates a 100-by-100 array with a chunk size of 10 in each dimension and no chunk overlap.
LOAD
Loading data into arrays is done using the load command.
The syntax for LOAD is:
LOAD array_name : string FROM file_path : string;
Example:
LOAD A FROM '/tmp/A.txt';
The data file formats supported by SciDB are documented in the AFL document. The file path 'A.txt' is assumed to be a path name relative to the coordinator's working directory. For this document, the file A.txt is a 100-by-100 file with two numerical values per cell.
Note: AQL only supports coordinator load (corresponding to nodeid=0 in the AFL load() command). See the section on load() in the AFL documentation.
DROP
Deleting arrays is done with the drop command.
DROP ARRAY array_name;
For example to drop array A, run the following command:
DROP ARRAY A;
The array and all its versions are dropped. Any associated mapping arrays created for noninteger dimensions are also dropped.
Note: Many AQL commands do not create or accept arrays with noninteger dimensions.
DML Query Syntax
SELECT attributes FROM
As in SQL, AQL SELECT is used to select one or more attributes from an array.
Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result"
For example:
SELECT * FROM A;
This is a basic query the selects all attributes from the whole array A. SELECT lists can also reference individual attributes and dimensions, as well as constants and expressions. Expressions are discussed later in this document.
SELECT err FROM A;
This is a SELECT query that selects the attribute err from A. To store the value of the attribute err in and array called C:
SELECT err INTO C FROM A;
This query selects the attribute err from the array A and stores the result in array C. If the array C does not exist, an array with the appropriate schema is created for it. If C exists, its attribute values are updated.
If an array exists in SciDB, but its schema has different chunking and overlap specifications, or some dimensions in the schema are declared as non-integer dimensions, then the repart() or redimension() is applied to the result before storing it into the destination array.
In place of a table A or a join result, SELECT can also be used to select data from an AFL operator result. For a full list of AFL operators, please refer to the "SciDB Array Functional Language Reference" document. Shown below is such an example with the "show" operator which return an array containing the schema of A.
Example:
SELECT * FROM show(A);
WHERE Clause
A WHERE clause is used to filter cells from the input array.
Syntax: SELECT "expr_list | *" [INTO name] FROM "array_name | array_result" [ WHERE expr ]
For example:
SELECT * FROM A WHERE err < 2;
The above query selects all cells in the array whose error is less than 2. The shape of the array is preserved by the WHERE clause. The user will see a result with the same shape as A, except the cells that don’t match the condition of the WHERE clause will be empty.
Note that the where clause can be used to check if a particular attribute is NULL. This is done with the 'IS NULL' and 'IS NOT NULL' construct.
SELECT x FROM A WHERE x IS NOT null;
SciSB supports SQL semantics for null values. Hence, any comparison or use of null values in expressions will evaluate to NULL, except for the ‘is null’ or 'is not null' predicates. For a description of NULL values in SciDB please refer to the section on "CREATE ARRAY" in the AFL document.
In contrast, EMPTY elements of an array are not evaluated by the WHERE clause.
AQL Expressions
AQL expressions in the SELECT list or the WHERE clause are standard expressions over the attributes and dimensions of the array. They can also refer to special built-in functions supported by SciDB. The full list of built-in functions is available in the Alphabetical List of Functions appendix to this document.
SELECT x FROM A WHERE sin(x)<0;
returns all values of the attribute x (in radians) where sin(x) is negative.
SELECT sqrt(x) FROM A WHERE sin(x)<0;
returns the square root of all values of the attribute x for which sin(x) is negative.
Natural JOIN
A join combines two or more arrays typically as a preprocessing step for subsequent operations. The following example illustrates SciDB JOIN.
Syntax:
SELECT "expr_list | *" [INTO name] FROM joinexpr [ WHERE expr ]
joinexpr := expr ( ("," expr) )*
Consider an array B, that has the same shape as array A declared earlier:
CREATE ARRAY B <y: double, err2: double> [i=0:99,10,0, j=0:99,10,0];
The following query returns an array by joining arrays A and B.
SELECT * FROM A, B;
This join is a natural join of array A with B. This is only possible with two arrays with matching shape. The arrays must have the same number of dimensions, matching dimension start and endpoints, and same chunk size. But dimension names need not match. The dimensions of the resulting array are the same dimensions as the inputs but with the dimension names from the first array.
The resulting array also has combined attributes of all inputs. Each cell in the resulting array in this example has 4 attributes, x, err from A and y, err2 from B. Natural join is useful when you need to merge two arrays with the same shape in order to perform further processing on their attributes.
The following example joins two arrays in order to compute the sum of two attributes for each element.
SELECT A.x + B.y FROM A,B;
JOIN ON
AQL JOIN ON is used to join two compatible arrays.
The JOIN ON returns an array that is the cross product join of its inputs. Only JOIN ON equality predicates are supported. Predicates must use compatible dimensions from the input arrays -- size, types and chunking.
Each cell in the output array contains all elements from the join inputs from corresponding dimension values.
AQL currently supports dimension-dimension, attribute-attribute and attribute-dimension JOINs.
Dimension-dimension join
In a dimension-dimension join, two or more arrays can be joined on dimensions. This is translated by AQL into AFL cross-joins. JOIN ON dimension predicates are compiled into the dimension pairs list for cross-join. Any aliasing required is performed by the AQL-to-AFL compiler.
Example: Consider the following example 3-by-3 arrays v1 and k.
CREATE array v1 <a:double, b:int64> [x=1:3,3,0, y=1:3,3,0];
CREATE ARRAY k <c: string, d:double> [x=1:3,3,0, y=1:3,3,0];
scan(v1)
[
[(1,1),(1,2),(1,3)],
[(2,1),(2,2),(2,3)],
[(3,1),(3,2),(3,3)]
]
scan(k)
[
[("addr_11",11),("addr_12",12),("addr_13",13)],
[("addr_21",21),("addr_22",22),("addr_23",23)],
[("addr_31",31),("addr_32",32),("addr_33",33)]
]
Performing a join on the first dimension creates a 3-by-3-by-3 array:
set lang AQL; DROP array res_dim; SELECT * INTO res_dim FROM v1 JOIN k ON v1.x = k.x; [ [ [(1,1,"addr_11",11),(1,1,"addr_12",12),(1,1,"addr_13",13)], [(1,2,"addr_11",11),(1,2,"addr_12",12),(1,2,"addr_13",13)], [(1,3,"addr_11",11),(1,3,"addr_12",12),(1,3,"addr_13",13)] ], [ [(2,1,"addr_21",21),(2,1,"addr_22",22),(2,1,"addr_23",23)], [(2,2,"addr_21",21),(2,2,"addr_22",22),(2,2,"addr_23",23)], [(2,3,"addr_21",21),(2,3,"addr_22",22),(2,3,"addr_23",23)] ], [ [(3,1,"addr_31",31),(3,1,"addr_32",32),(3,1,"addr_33",33)], [(3,2,"addr_31",31),(3,2,"addr_32",32),(3,2,"addr_33",33)], [(3,3,"addr_31",31),(3,3,"addr_32",32),(3,3,"addr_33",33)] ] ]
The output of the join on statement above is an array with the following schema:
<a:double NOT NULL,b:int64 NOT NULL,c:string NOT NULL,d:double NOT NULL>
[x=1:3,3,0,y=1:3,3,0,y_2=1:3,3,0]
The new generated dimension name is y_2.
Attribute-attribute JOIN
The JOIN ON predicate may have a list of equality predicates based on array attributes. If this is done, the inputs are first transformed into temporary stored arrays using the redimension_store() operator and then joined using AFL cross_join(). Any additional arrays created to execute the JOIN are removed at the end of query execution.
The following shows an example of attribute-attribute join:
SELECT * INTO res_attr FROM v1 JOIN k ON v1.a = k.d;
The above query produces a result with the following schema:
<a:double NOT NULL,b:int64 NOT NULL,c:string NOT NULL,
empty_indicator:indicator NOT NULL> [x=1:3,3,0,y=1:3,3,0]
SELECT * FROM vector3 A JOIN vector3 B ON A.a = B.b and A.b = B.a;
produces a result with the following schema:
<a:int64 NOT NULL, b:int64 NOT NULL, empty_indicator:indicator NOT NULL> [i=*:*,10,0, i_2=*:*,10,0]
Dimension-Attribute Join
A dimension-attribute join can be performed between attributes and dimensions that have the same data type.
SELECT * INTO result FROM v1 JOIN k ON v1.b = k.x; [ [ [(2,1,"addr_11"),(),()], [(),(3,2,"addr_22"),()], [(),(),(4,3,"addr_33")] ], [ [(),(3,2,"addr_22"),()], [(),(),()],[(),(),()]], [ [(),(),(4,3,"addr_33")], [(),(),()],[(),(),()] ] ]
Aggregates
AQL supports the following built-in aggregate functions:
| Name | Operation Performed |
|---|---|
| avg | Average value |
| count | Number of nonempty elements |
| max | Largest value |
| min | Smallest value |
| sum | Sum of all the elements |
| stdev | Standard deviation |
| var | Variance |
The scalar aggregate functions in AQL take an input array and produce a scalar result. The value for the result is the aggregate function applied to the entire array.
For example:
SELECT max (SELECT a1 FROM a);
Each array aggregate function accepts one attribute of appropriate type. For example, to compute the maximum value of an attribute, that attribute type must support inequality and equality functions. So, for the max aggregate, the data types that can be used are
integers (int8, int16, int32, int64, uint8, uint16, uint32, uint64) and floating point (double,float) types.
count here is an aggregate function that takes an array as input and returns a scalar value in its output.
Below is a query that counts all the measurements of array A:
SELECT count(A);
Note: While all aggregates can be calculated in this manner over the entire array typically on one attribute, the count function is an exception. In the AFL version, you can use a dimension name as an optional input into the count function.
- When count() is used without any arguments, the result is a
count of the number of non-empty elements in the array.
- When count is supplied
with a dimension name, count counts the number of non-null occurrences of that attribute within the array.
GROUP BY Aggregates
AQL aggregates can also be used with the GROUP BY clause. The syntax for a group by aggregate is as follows.
SELECT "namedExprList | *" [ INFO name] FROM joinexpr [ WHERE expr ] [ groupBy ];
groupBy := "group by" dimensionList "as" pathExpr
dimensionList := dim ("," dim)*
AQL GROUP BY takes a list of dimensions as a "group by" list, groups the array by the specified dimensions and computes the aggregate function for each group. The result of aggregate is an array with same dimensions as the source array without the group by dimensions. The result array has a single attribute which is the aggregate result per group.
The result of a "GROUP BY x, y" is an array with only x, y dimensions and a new attribute which is the corresponding partition. This partition can be accessed by sub-queries or expressions in the SELECT aggregate clause. The following examples explain the GROUP BY aggregate.
Example 1. count group by
Suppose we want to first group measurements by the first dimension, then calculate the count over each of the groups, the AQL query will look like this:
SELECT count(part) FROM A GROUP BY i AS part;
The group by clause here creates a nested array for each group and names the nested array part. Part is a one-dimensional array corresponding to a given value of dimension i and attributes of the input array A.
Example 2. group by avg over an attribute
Now suppose we need to compute the average measurement from A, grouped by i we issue the following AQL query which selects the attribute to be used as input to the average. AQL also supports standard SQL-like syntax for computing the aggregate over array groupings. In the current release of SciDB, the AQL GROUP BY must provide a name for the partition attribute created in the GROUP BY aggregate. See example below.
SELECT avg(x) FROM A GROUP BY i AS T;
Nested sub-queries
In this release nested queries can either be simple aggregate queries in the SELECT clause, or nested queries in the FROM clause. Other kinds of nested queries will be supported in forthcoming releases
In the example below, we select partitions whose average value is greater than 10:
SELECT * FROM ( SELECT avg (SELECT x FROM part) FROM A GROUP BY i AS part ) WHERE avg > 10;
Updates and Versions
AQL arrays that are updatable can be updated using the following command:
UPDATE array SET "attr = expr", ... [ WHERE condition ];
For example, to add 1.0 to each measurement in array A and add 0.05 to each error, we can write the following query:
UPDATE A SET x = x+1, err = err + 0.05;
Update can be used with a WHERE clause. To increase the measurement only if the error is smaller than 0.05:
UPDATE A SET x = x+1 WHERE err < 0.05;
Note that the UPDATE clause can be used to update non-null elements using the following example:
SELECT * FROM A; [(), (2), (3)]
UPDATE A SET a = a + 1 WHERE a IS NOT NULL; [(), (3), (4)]
UPDATE A SET a = 22222 WHERE a IS NULL; [(22222),(3),(4)]
For a definition of NULLs in the SciDB array data model please refer to the AFL documentation for this release.
SciDB saves all versions of the data, so the user has an option to query data as it was in the database at a specific time. This is done by appending a @ datetime prefix after the name of the array. For example, the following query returns data as of November 11 2010:
SELECT * FROM A @ datetime('2011-12-15 22:33:12');
If multiple versions have the same datetime, the most recent version, i.e., the one with the highest version number is returned. The timestamp of a version is the time on the SciDB server, which is by default reported in UTC timezone.
To set the time zone, use the datetimetz argument.
For convenience, SciDB includes a now() function, so one can write a query to retrieve all data that was in the array 1000 miliseconds prior to current date and time:
SELECT * FROM A @ now() - 1000;
Any expression yielding a datetime value is supported after the @ operator.
Aliases
AQL provides a way to refer to arrays in the query via aliases. These are useful when using the same array repeatedly in an AQL statement, or when abbreviating a long array name. Aliases are created by adding an "as" to the array name, followed by the alias. Future references to the array can then use the alias.
Example:
SELECT * FROM vector3 AS A;
Once an alias has been assigned, all attributes and dimensions of the array can use the fully qualified name using the dotted naming convention.
SELECT A.x + 10 from vector3 AS A;
For example:
SELECT pow(sin(a1), 2) as p1, pow(cos(a1), 2) as p2 FROM A;