User-Defined Types, and User-Defined Functions, in SciDB
Starting with the 0.8 release, SciDB supports user-defined functions (UDFs) and user-defined types (UDTs). User-defined extensions to SciDB are generally referred to as plugins. Other ways in which users can extend SciDB will in the near future include user-defined aggregates (UDAs) and user-defined operators (UDOs). Here we will focus on UDTs, and UDFs. We begin by describing what they are and how they appear in the context of SciDB's data model. Then we will describe how to write one and how to integrate it into the SciDB instance.
Extensibility: Types and Functions
Out of the box, SciDB provides users with the expected small set of data types; integer, double, and string. Scientific and large-scale analytic applications often require other data types: complex numbers, rational numbers, two-dimensional points, etc. Also, many applications call for specific mathematical functions that SciDB does not provide by default. For example, computing the greatest common factor of two integers, or a random number generator for some non-uniform distribution. SciDB's extensibility mechanism allows users to add their own implementation of types and functions to the SciDB engine.
Suppose a SciDB application requires a rational number data type. Rather than use a double precision data type the user wants to store instead a paired integer numerator and denominator. As part of the the new type's functionality user's will also want to support basic math functionality for their new type ( +, -, *, - ) together with comparison operations (<, <=, =, =>, >, !=).
At the level of the AQL query language, the new type can be used as follows:
create array rational_example < N : rational > [ I=0:99,10,0, J=0:99,10,0 ] # Q1: SELECT COUNT(*) FROM rational_example AS R WHERE R.N = rational(1,2); # Q2: SELECT str(R.N) FROM rational_example AS R WHERE R.N + rational(1,4) > rational(1,2);
So far as a user's queries are concerned, there will be no difference between the way a built-in type and a user-defined type (or function) works. There are, however, a couple of things to be aware of:
- All type conversions need to be explicit. SciDB does not (yet) support implicit casting.
- Client applications can only accommodate a limited set of types: doubles, integers, and strings. When you write queries (using iquery, say) the query's result needs to explicitly convert result types into something that the client understands.
- While it is quite possible to write complex and computationally expensive UDFs (we include an example of a prime number factorization) it's generally a better practice to build UDFs as small, self-contained units of functionality and then to combine them using the SciDB query language's facilities.
- We do not (yet) support features like embedding queries within UDFs, or plugins that do anything more sophisticated than take a vector of scalar values and return a single scalar type result.
SciDB Plugins
In this document we use the term plugins to refer to several kinds of extensions.
- User defined types (UDT) which can be used when users create arrays.
- User defined scalar function (UDFs) for adding new functionality to SciDB or for working with UDTs.
We plan to support user-defined operators (UDOs) in the near future. As of release 0.8, however, our focus is on user-defined types and functions as described on this page.
We include multiple example extensions in the ~/examples folder located beneath the SciDB root. These examples are:
| Name | Description |
| complex | Complex Number UDT (a + b.i), together with the associated algebraic operations, and equality. |
| rational | Rational Number UDT (int64 numerator and denominator) together with the associated algebra operations and ordering comparisons |
| point | 2-D Point UDT. Double precision X and Y |
| more_math | A selection of user-defined functions which perform useful mathematical operations. |
SciDB Plugins Architecture
The basic architecture of a SciDB Plugin works as follows. The algorithms implemented within the SciDB engine are designed to treat instances of data type values as black box memory segments. For example, all that the SciDB engine "knows" about the contents of the Complex Number data type is that it is 16 bytes long. The code that needs to know about the contents of these 16 bytes is implemented by the user in their own C/C++, which they compile into a shared library.
At run-time, the SciDB engine dynamically links this shared library and calls the functions it contains to perform the operations specified in the query.
For example, the user-written 'C' code to add two complex numbers looks like this:
//
// This is the struct that describes how the 16 bytes of data that makes up an instance
// of a SciDB Complex UDT is organized.
struct Complex
{
double re;
double im;
};
//
// This is the code that takes data from the SciDB engine, performs the addition, and deposits
// the return result in an appropriately sized "black box" of bytes. The SciDB engine takes
// this return result and stores it, or passes it on to another function.
static void addComplex(const typesystem::Value* args, typesystem::Value& res, void*)
{
Complex& a = *(Complex*)args[0].data();
Complex& b = *(Complex*)args[1].data();
Complex& c = *(Complex*)res.data();
c.re = a.re + b.re;
c.im = a.im + b.im;
}
When it parses a query like the one labelled "Q2" above, the SciDB engine checks to ensure that it had been provided with a shared library containing code to perform the plus ( Type, Type ) -> Type operation. In this case, SciDB would look for a function named "+" that took two arguments of the appropriate type (in this case, a pair of complex number instances). Then at run time the SciDB engine would assemble the necessary 16 byte "black boxes", invoke the function 'C' addComplex(), and deal with the value it computed.
User-Defined Functions: How SciDB gives you Data Type Instances
As you can see from the example code above, SciDB uses a typesystem::Value class to encapsulate information about all type value instances. The Values class provides a set of methods for getting and setting the "value" of the class for each of the SciDB built-in types; getType() and setType(), or more explicitly (in the case of a typesystem::Value val; instance that contains a string) val.getString() and val.setString().
From the perspective of the SciDB engine, all UDFs have the same basic signature:
void functionUDF ( const typesystem::Value * args, typesystem::Value& res, void * scratch)
{
}
Each function must "know" how many arguments it is to receive. These arguments are obtained from the vector of typesystem::Value pointers that makes up the first argument. Each UDF (currently) returns a single result and the location where this result is to be placed is passed in by reference in the second argument. The final argument is a pointer to a data structure that conveys information about the state of the engine, and is a means of passing data between repeated calls to the UDF within the same query.
Loading a Plugin
Each of these example plugins included with the SciDB distribution is built (by default) at the time we build the core engine. However, SciDB does not load unregistered plugins when it starts up. To use one of the examples you need to load it into the SciDB instance. The following figure illustrates how to load shared libraries containing plugins into SciDB using first, the AFL interface, and second, our AQL query language.
-- -- AFL load_library() operation load_library ( 'librational.so' ) -- -- AQL 'load library' syntax load library 'librational.so';
The act of loading a plugin shared library first registers the library in the SciDB system catalogs. Then it opens and examines the shared library to store its contents with SciDB's internal extension management subsystem. Shared library module which are registered with the SciDB instance will be loaded at system start time.
If you want to unload library run:
--
-- AFL unload_library() operation
unload_library('libpoint1')
--
-- AQL 'unload library' syntax
unload library 'libpoint1'
This command will unregister the library in the system catalog. The library will not be loaded on consecutive restart, but it might not be unloaded immediately because some queries can be using it.
Tutorial: Creating SciDB Plugins
Let's consider steps needed for creating a new plugin for scidb
Designing your UDT
Your UDT will need the following kinds of UDFs.
- UDFs that construct instances of your new type based on the values of other types. In general the types you will use as input to these UDFs will be built in types. For example, it is typical to use a string as a source for a new data type's contents. For example, the following UDF converts a string with a particular format into an instance of a rational number UDT.
//
// This is the struct used to store the data inside SciDB.
typedef struct
{
int64_t num;
int64_t denom;
} SciDB_Rational;
void str2Rational(const typesystem::Value* args, typesystem::Value& res, void*){
int64_t n, d;
SciDB_Rational * r = (SciDB_Rational*)res.data();
if (sscanf(args[0].getString(), "(%lld/%lld)", &n, &d) != 2)
throw USER_EXCEPTION(SCIDB_E_INVALID_OPERAND, "Can't convert str to rational - expected '( int / int )'");
boost::rational<int64_t>rp0(n, d);
r->num = rp0.numerator();
r->denom = rp0.denominator();
}
Note that the "string to UDT" conversion function is particularly important. type ( string ) -> type is the UDF used by the load() operation to bulk ingest data into SciDB.
- UDFs that convert your UDT back into a built-in type, or a number of built-in types. In the case of the Complex type, for example, you can either write a UDF that composes the 16 bytes into a string, or else a pair of UDFs that extract the real and imaginary portions of the type.
static void reComplex(const typesystem::Value* args, typesystem::Value& res, void*)
{
Complex& a = *(Complex*)args[0].data();
res.setDouble(a.re);
}
static void imComplex(const typesystem::Value* args, typesystem::Value& res, void*)
{
Complex& a = *(Complex*)args[0].data();
res.setDouble(a.im);
}
Remember that SciDB will not perform any implicit casting. You need to include these UDFs in any queries that pull these values out of database.
- UDFs that perform common type operations, such as simple algebra ops, comparisons and so on. Note that not all types will need all of these functions. While it makes sense to support the full set of relational operators for an order-able type such as Rational Number, a values in the Complex domain cannot be ordered (for sorting, say). All SciDB needs for ordering are two UDFs: one to return true when two type values are equal, and a second to return true when one type is less than another.
void rationalLT(const typesystem::Value* args, typesystem::Value& res, void*)
{
SciDB_Rational* r0 = (SciDB_Rational*)args[0].data();
SciDB_Rational* r1 = (SciDB_Rational*)args[1].data();
boost::rational<int64_t>rp0(r0->num, r0->denom);
boost::rational<int64_t>rp1(r1->num, r1->denom);
if ( rp0 < rp1 )
res.setBool(true);
else
res.setBool(false);
}
void rationalEQ(const typesystem::Value* args, typesystem::Value& res, void*)
{
SciDB_Rational* r0 = (SciDB_Rational*)args[0].data();
SciDB_Rational* r1 = (SciDB_Rational*)args[1].data();
boost::rational<int64_t>rp0(r0->num, r0->denom);
boost::rational<int64_t>rp1(r1->num, r1->denom);
if ( rp0 == rp1 )
res.setBool(true);
else
res.setBool(false);
}
- UDFs that are necessary to support the integration of the UDT with other facilities; aggregates like AVG(), MAX() and MIN(), for example. MAX() and MIN() use the UDFs that order instance values. If your type has a peculiar requirements for MAX() and MIN(), it might be reasonable to add these UDFs.
Exceptions and Error Handling
Your UDFs will often need to check for errors and exceptions in their code. In SciDB, we provide facilities to report to the SciDB engine that your UDF has encountered an error, and what kind of error. Doing this allows the SciDB engine to terminate the query and report some useful status information to the log file. Errors and exceptions are thrown using a macro USER_EXCEPTION( error_code, description : string ).
if (... some error condition test ... )
throw USER_EXCEPTION(SCIDB_E_INVALID_OPERAND, "Useful error message string");
For a full list of the terse error codes that you can throw from within a UDF, consult the '~/include/system/ErrorCodes.h' file.
Registering Your 'C' Functions as UDFs
Once you have implemented your functions, the next step involves registering them with the SciDB facilities for extracting information from a shared library. The SciDB install provides a set of 'C' macros to do these. These macros are:
| Macro Name | Description | Example |
| REGISTER_TYPE ( name, length ) | Instructs SciDB to register a new UDT in it's catalogs with the name provided (note that this argument to the macro is not a string) and the length, in bytes, of the type instance values. | REGISTER_TYPE ( complex, 16 ) |
| REGISTER_FUNCTION ( name, input argument types, output argument type, function pointer) | Instructs SciDB to register a new UDF in its catalogs. The new UDF can be called in AQL or AFL using the first argument name (again, not a string), the the function is expected to take a list of argument types as input, and return a value of the type provided. The actual reference to the function you want SciDB to call is the last argument to the macro. | REGISTER_FUNCTION(+, ("complex", "complex"), "complex", addComplex); |
| REGISTER_CONVERTER(input type, output type, conversion cost, function pointer) | From time to time SciDB needs to convert types, and it can require UDFs to perform this operation. This macro is how you register conversions. | REGISTER_CONVERTER(string, complex, EXPLICIT_CONVERSION_COST, string2complex); |
A Simple Recipe
The simplest way to implement your own plugin library is to copy the style of the examples.
- Create a new directory in parallel to the one that implements one of the examples, say the complex type.
In the ~/examples/CMakeLists.txt file, add new line with name of new directory. Let's say the new directory is named "point1"
add_subdirectory("complex") <-- already exists.
add_subdirectory("point1") <-- the reference to your new plugin directory
Now it's better to rename point1/complex.cpp to something more appropriate for the purposes of the library.
- Change the contents of the new "~/examples/point1/CMakeLists.txt" file to get the server to build your new plugin library.
- Make your modifications in the new source code file ~/examples/point1/point.cpp
- Using "make", build "libpoint1.so". It will be places into the plugins directory folder alongside "libcomplex.so".
- Load your new library module.
And you're done!