wiki:Docs/Development/Plugins_trunk

SciDB Plugins: Extending SciDB Functionality

User-defined extensions to SciDB functionality are referred to as plugins. SciDB supports user-defined functions (UDFs), user-defined types (UDTs), and user-defined operators (UDOs).

Extensibility: Types and Functions

Out of the box, SciDB provides users with a standard set of data types; integer, float/double, and string. Scientific and large-scale analytic applications often require other data types such as complex numbers, rational numbers, two-dimensional points, or others. Some applications call for specific mathematical functions (such as greatest common factor of two integers or non-uniform random number generation) that SciDB does not provide by default. SciDB's extensibility mechanism allows users to add their own implementation of types and functions to the SciDB engine.

Suppose a SciDB application requires a rational number datatype. Rather than use double precision, the user wants to store an integer-type numerator and denominator pair. As part of the the new type's functionality users will also want to support basic arithmetic ( +, -, *, / ) and logical (<, <=, =, >=, >, <>) functionality.

At the level of the AQL query language, the new type can be used as follows:

create array rational_example < N : rational > [ I=0:99,10,0, J=0:99,10,0 ]

# Q1: 

SELECT COUNT(*)
   FROM rational_example AS R
 WHERE R.N = rational(1,2);

# Q2: 

SELECT str(R.N) 
  FROM rational_example AS R
 WHERE R.N + rational(1,4) > rational(1,2);
  

So far as a user's queries are concerned, there will be no difference between the way a built-in type and a user-defined type (or function) works. There are, however, a couple of things to be aware of:

  1. All type conversions need to be explicit. SciDB does not (yet) support implicit casting.
  2. Client applications can only accommodate a limited set of types: doubles, integers, and strings. When you write queries (using iquery, say) the query's result needs to explicitly convert result types into something that the client understands.
  3. While it is quite possible to write complex and computationally expensive UDFs (we include an example of a prime number factorization) it's generally a better practice to build UDFs as small, self-contained units of functionality and then to combine them using the SciDB query language's facilities.
  4. We do not (yet) support features like embedding queries within UDFs, or plugins that do anything more sophisticated than take a vector of scalar values and return a single scalar type result.

SciDB Plugin Examples

SciDB includes multiple example extensions in the ~/examples folder located beneath the SciDB root directory. These examples are:

NameDescription
complexComplex Number UDT (a + b.i), together with the associated algebraic operations, and equality.
rationalRational Number UDT (int64 numerator and denominator) together with the associated algebra operations and ordering comparisons
point2-D Point UDT. Double precision X and Y
more_mathA selection of user-defined functions which perform useful mathematical operations.

SciDB Plugins Architecture

The basic architecture of a SciDB Plugin works as follows. The algorithms implemented within the SciDB engine are designed to treat instances of data type values as black box memory segments. For example, all that the SciDB engine "knows" about the contents of the Complex Number data type is that it is 16 bytes long. The code that needs to know about the contents of these 16 bytes is implemented by the user in their own C/C++, which they compile into a shared library.

At run-time, the SciDB engine dynamically links this shared library and calls the functions it contains to perform the operations specified in the query.

For example, the user-written 'C' code to add two complex numbers looks like this:

//
// This is the struct that describes how the 16 bytes of data that makes up an instance 
// of a SciDB Complex UDT is organized.
struct Complex
{
    double re;
    double im;
};

//
// This is the code that takes data from the SciDB engine, performs the addition, and deposits
// the return result in an appropriately sized "black box" of bytes. The SciDB engine takes 
// this return result and stores it, or passes it on to another function. 
static void addComplex(const Value** args, Value* res, void*)
{
    Complex& a = *(Complex*)args[0]->data();
    Complex& b = *(Complex*)args[1]->data();
    Complex& c = *(Complex*)res->data();

    c.re = a.re + b.re;    
    c.im = a.im + b.im;
}

When it parses a query like the one labeled "Q2" above, the SciDB engine checks to ensure that it had been provided with a shared library containing code to perform the plus ( Type, Type ) -> Type operation. In this case, SciDB would look for a function named "+" that took two arguments of the appropriate type (in this case, a pair of complex number instances). Then at run time the SciDB engine would assemble the necessary 16-byte "black boxes", invoke the function 'C' addComplex(), and deal with the value it computed.

User-Defined Functions: How SciDB Provides Datatype Instances

As you can see from the example code above, SciDB uses a typesystem::Value class to encapsulate information about all type value instances. The Values class provides a set of methods for getting and setting the "value" of the class for each of the SciDB built-in types; getType() and setType(), or more explicitly (in the case of a typesystem::Value val; instance that contains a string) val.getString() and val.setString().

From the perspective of the SciDB engine, all UDFs have the same basic signature:

void functionUDF(const Value** args, Value* res, void*)
{

}

Each function must "know" how many arguments it is to receive. These arguments are obtained from the vector of typesystem::Value pointers that makes up the first argument. Each UDF (currently) returns a single result and the location where this result is to be placed is passed in by reference in the second argument. The final argument is a pointer to a data structure that conveys information about the state of the engine, and is a means of passing data between repeated calls to the UDF within the same query.

Loading a Plugin

Each of these example plugins included with the SciDB distribution is built (by default) at the time we build the core engine. However, SciDB does not load unregistered plugins when it starts up. To use one of the examples you need to load it into the SciDB instance. The following figure illustrates how to load shared libraries containing plugins into SciDB using first, the AFL interface, and second, our AQL query language.

--
-- AFL load_library() operation
load_library ( 'librational.so' )

--
-- AQL 'load library' syntax
load library 'librational.so';

The act of loading a plugin shared library first registers the library in the SciDB system catalogs. Then it opens and examines the shared library to store its contents with SciDB's internal extension management subsystem. Shared library module which are registered with the SciDB instance will be loaded at system start time.

If you want to unload library run:

--
-- AFL unload_library() operation
unload_library('libpoint1')

--
-- AQL 'unload library' syntax
unload library 'libpoint1'

This command will unregister the library in the system catalog. The library will not be loaded on consecutive restart, but it might not be unloaded immediately because some queries can be using it.

Tutorial: Creating SciDB Plugins

This section explains the steps needed for creating a new plugin for SciDB.

Designing your UDT

Your UDT will need the following kinds of UDFs.

  1. UDFs that construct instances of your new type based on the values of other types. In general the types you will use as input to these UDFs will be built in types. For example, it is typical to use a string as a source for a new data type's contents. For example, the following UDF converts a string with a particular format into an instance of a rational number UDT.
//
// This is the struct used to store the data inside SciDB. 
typedef struct
{
    int64_t num;
    int64_t denom;
} SciDB_Rational;

void str2Rational(const Value** args, Value* res, void*)
{
	int64_t n, d;
    SciDB_Rational* r = (SciDB_Rational*)res->data();

    if (sscanf(args[0]->getString(), "(%"PRIi64"/%"PRIi64")", &n, &d) != 2)
        throw PLUGIN_USER_EXCEPTION("librational", SCIDB_SE_UDO, RATIONAL_E_CANT_CONVERT_TO_RATIONAL)
            << args[0]->getString();

	if ((0 == d) && (0 == n))
		d = 1;
		
	boost::rational<int64_t>rp0(n, d);
	r->num   = rp0.numerator();
	r->denom = rp0.denominator();

}

Note that the "string to UDT" conversion function is particularly important. type ( string ) -> type is the UDF used by the load() operation to bulk ingest data into SciDB.

  1. UDFs that convert your UDT back into a built-in type, or a number of built-in types. In the case of the complex type, for example, you can either write a UDF that composes the 16 bytes into a string, or else a pair of UDFs that extract the real and imaginary portions of the type.
static void reComplex(const Value** args, Value* res, void*)
{
   Complex& a = *(Complex*)args[0]->data();
   res->setDouble(a.re);
}

static void imComplex(const Value** args, Value* res, void*)
{
   Complex& a = *(Complex*)args[0]->data();
   res->setDouble(a.im);
}

Remember that SciDB will not perform implicit casting. You need to include these UDFs in any queries that pull these values out of database.

  1. UDFs that perform common type operations, such as simple arithmetic or relational operations will not need to support all datatypes. While it makes sense to support the full set of relational operators for a datatype that can be ordered (such as rational number), a values in the complex domain cannot be ordered (for sorting, say). All SciDB needs for ordering are two UDFs: one to return TRUE when two type values are equal, and a second to return TRUE when one type is less than another.
void rationalLT(const Value** args, Value* res, void * v)
{
    SciDB_Rational* r0 = (SciDB_Rational*)args[0]->data();
    SciDB_Rational* r1 = (SciDB_Rational*)args[1]->data();

	check_zero ( r0 );
	check_zero ( r1 );

	boost::rational<int64_t>rp0(r0->num, r0->denom);
	boost::rational<int64_t>rp1(r1->num, r1->denom);

	if ( rp0 < rp1 ) 
        res->setBool(true);
	else
        res->setBool(false);
}

void rationalEQ(const Value** args, Value* res, void * v)
{
    SciDB_Rational* r0 = (SciDB_Rational*)args[0]->data();
    SciDB_Rational* r1 = (SciDB_Rational*)args[1]->data();

	check_zero ( r0 );
	check_zero ( r1 );

	boost::rational<int64_t>rp0(r0->num, r0->denom);
	boost::rational<int64_t>rp1(r1->num, r1->denom);

	if ( rp0 == rp1 ) 
        res->setBool(true);
	else
        res->setBool(false);
}

  1. UDFs that are necessary to support the integration of the UDT with other facilities; aggregates like AVG(), MAX() and MIN(), for example. MAX() and MIN() use the UDFs that order instance values. If your type has a peculiar requirements for MAX() and MIN(), it might be reasonable to add these UDFs.

Exceptions and Error Handling

Your UDFs will often need to check for errors and exceptions in their code. In SciDB, we provide facilities to report to the SciDB engine that your UDF has encountered an error, and what kind of error. Doing this allows the SciDB engine to terminate the query and report some useful status information to the log file. Errors and exceptions are thrown using a macro USER_EXCEPTION( error_code, description : string ).

    throw PLUGIN_USER_EXCEPTION(<plugin name>, SCIDB_SE_UDO, \
    <plugin error code>) << <args>;

For example,

void str2Rational(const Value** args, Value* res, void*)
{
	int64_t n, d;
    SciDB_Rational* r = (SciDB_Rational*)res->data();

    if (sscanf(args[0]->getString(), "(%"PRIi64"/%"PRIi64")", &n, &d) != 2)
        throw PLUGIN_USER_EXCEPTION("librational", SCIDB_SE_UDO, RATIONAL_E_CANT_CONVERT_TO_RATIONAL)
            << args[0]->getString();

	if ((0 == d) && (0 == n))
		d = 1;
		
	boost::rational<int64_t>rp0(n, d);
	r->num   = rp0.numerator();
	r->denom = rp0.denominator();

}

For a full list of the terse error codes that you can throw from within a UDF, consult the '~/include/system/ErrorCodes.h' file.

Registering Your 'C' Functions as UDFs

Once you have implemented your functions, you should register them with the SciDB facilities for extracting information from a shared library. The SciDB install provides a set of 'C' macros to do this. These macros are:

Macro NameDescriptionExample
REGISTER_TYPE ( name, length )Instructs SciDB to register a new UDT in it's catalogs with the name provided (note that this argument to the macro is not a string) and the length, in bytes, of the type instance values.REGISTER_TYPE ( complex, 16 )
REGISTER_FUNCTION ( name, input argument types, output argument type, function pointer)Instructs SciDB to register a new UDF in its catalogs. The new UDF can be called in AQL or AFL using the first argument name (again, not a string), the the function is expected to take a list of argument types as input, and return a value of the type provided. The actual reference to the function you want SciDB to call is the last argument to the macro.REGISTER_FUNCTION(+, ("complex", "complex"), "complex", addComplex);
REGISTER_CONVERTER(input type, output type, conversion cost, function pointer)From time to time SciDB needs to convert types, and it can require UDFs to perform this operation. This macro is how you register conversions.REGISTER_CONVERTER(string, complex, EXPLICIT_CONVERSION_COST, string2complex);

A Simple Recipe

The simplest way to implement your own plugin library is to copy the style of the examples.

  1. Create a new directory in parallel to the one that implements one of the examples, say the complex type.

In the ~/examples/CMakeLists.txt file, add new line with name of new directory. Let's say the new directory is named "point1"

add_subdirectory("complex")   <-- already exists.
add_subdirectory("point1")       <-- the reference to your new plugin directory

At this point you will want to rename point1/complex.cpp to something more appropriate for the purposes of the library.

  1. Change the contents of the new ~/examples/point1/CMakeLists.txt file to get the server to build your new plugin library.
  1. Make your modifications in the new source code file: ~/examples/point1/point.cpp.
  1. Using "make", build "libpoint1.so". It will be placed into the plugins directory folder alongside "libcomplex.so".
  1. Load your new library module.

User-Defined Operators

The most complicated user-defined objects are user-defined operators. Every operator in SciDB is a pair of objects:

  • A logical operator class, and
  • A physical operator class.

The main purpose of logical operator is:

  • to infer an array schema, and
  • to provide information about expected inputs and parameters of the operator.

Ideally, the logical operator is common to every operator of the same class. However, the logical operator can have several implementations called physical operators. The main purpose of physical operator to execute operator implementation.

Every operator, logical or physical, can have a state. States are created by special factory methods. Every instance of an operator is a new instance of the class. This means that you can add a new field to inherited classes.

Creating a User-Defined Operator

The easiest way to create a new operator is to find the closest built-in operator, copy-and-paste it into a separate folder, and change the existing implementation into the desired implementation.

In the example/operators directory in your SciDB build you can find a stub example for creating a plugin with user-defined operators. You can replace the example stubs by the built-in operator implementation that is closest to what you want your new operator to do and then rename internal classes and operator names.

The following sections provide short descriptions of base classes for logical and physical operators and descriptive comments about class members.

Logical Operator Example

The logical operator must be inherit from the LogicalOperator class and implement the methods constructor and inferSchema:

class LogicalStub : public LogicalOperator
{
public:
    LogicalStub(const std::string& logicalName, const std::string& alias):
        LogicalOperator(logicalName, alias)
    {
        /**
         * See built-in operators implementation for example
         */
    }

    ArrayDesc inferSchema(std::vector<ArrayDesc> schemas, boost::shared_ptr<Query> query)
    {
        /**
         * See built-in operators implementation for example
         */
        return ArrayDesc();
    }

};

The constructor contains code for the declaration of possible inputs and parameters. For example, the APPLY operator has the following constructor:

    Apply(const std::string& logicalName, const std::string& alias):
        LogicalOperator(logicalName, alias)
    {
        _properties.tile = true;
        ADD_PARAM_INPUT()
        ADD_PARAM_OUT_ATTRIBUTE_NAME("void")//0
        ADD_PARAM_EXPRESSION("void")        //1
        ADD_PARAM_VARIES()
    }
  • properties.tile is true if operator can work in tile mode.
  • ADD_PARAM_INPUT() says that operator expects one more input (in this case, an input array).
  • ADD_PARAM_OUT_ATTRIBUTE_NAME("void") says that the operator will add new attribute with the given data type ("void" means "any").
  • inferSchema will produce the real data types based on input schema.
  • ADD_PARAM_EXPRESSION("void") says that operator expect one expression with "any" ("void") data type. You may add other attributes and attribute kinds as well.
  • ADD_PARAM_VARIES() means that APPLY can have a variable number of parameters. In this case you need to implement one more virtual method nextVaryParamPlaceholder. See the APPLY implementation for example.
  • inferSchema provides the schema for resultant array.

Physical operators

Physical operators must inherit from the PhysicalOperator class and implement the execute method:

class PhysicalStub: public PhysicalOperator
{
public:
    PhysicalStub(const std::string& logicalName, const std::string& physicalName,\
    const Parameters& parameters, const ArrayDesc& schema):
	    PhysicalOperator(logicalName, physicalName, parameters, schema)
	{
	}

    shared_ptr<Array> execute(std::vector<shared_ptr<Array> >& inputArrays,\
    shared_ptr<Query> query)
	{
        /**
         * See built-in operators implementation for example
         */
        return shared_ptr<Array>();
	}
};

For example, here is the APPLY operator:

    boost::shared_ptr<Array> execute(vector< boost::shared_ptr<Array> >& inputArrays,\
    boost::shared_ptr<Query> query)
    {
        assert(inputArrays.size() == 1);
        assert(_parameters.size()%2 == 0);

        vector<shared_ptr<Expression> > expressions(0);

        size_t currentParam = 0;
        for(size_t i =0; i< _schema.getAttributes().size(); i++)
        {
            assert(_parameters[currentParam]->getParamType() == PARAM_ATTRIBUTE_REF);
            assert(_parameters[currentParam+1]->getParamType() == PARAM_PHYSICAL_EXPRESSION);

            string const& schemaAttName = _schema.getAttributes()[i].getName();
            string const& paramAttName = \
            ((boost::shared_ptr<OperatorParamReference>&)_parameters[currentParam])->\
            getObjectName();

            if(schemaAttName!=paramAttName)
            {
                expressions.push_back\
                ( shared_ptr<Expression> ());
            }
            else
            {
                expressions.push_back(((boost::shared_ptr<OperatorParamPhysicalExpression>&)\
               _parameters[currentParam+1])->getExpression());
                currentParam+=2;
            }

            if(currentParam == _parameters.size())
            {
                for (size_t j = i+1; j< _schema.getAttributes().size(); j++)
                {
                    expressions.push_back( shared_ptr<Expression> () );
                }
                break;
            }
        }

        assert(currentParam == _parameters.size());
        assert(expressions.size() == _schema.getAttributes().size());

        boost::shared_ptr<Array> input = inputArrays[0];
        return boost::shared_ptr<Array>(new ApplyArray(_schema, input, \
        expressions, query, _tileMode));
    }

The execute() method takes a number of input arrays and query context. It can use all methods of input arrays and perform any evaluations. The result must be a new array instance.

It is also possible to create a pipelined array instance which will perform evaluations only when data will be requested. For example, you may want to evaluate a chunk only when the getChunk method is called. ApplyArray in the above code is an example of such an array.