You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Shadi Khalifa <kh...@cs.queensu.ca> on 2015/04/01 13:06:23 UTC
Re: Passing multiple columns to a UDAF

Thanks Jason for all this information!! Really appreciate it!!! 
Regards
Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
I'm just a neuron in the society collective brain

01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101 01100111 01111001 01110000 01110100 
P Please consider your environmental responsibility before printing this e-mail

 


     On Tuesday, March 31, 2015 4:04 PM, Jason Altekruse <al...@gmail.com> wrote:
   

 Hi Shadi,

Unfortunately that isn't going to be a good strategy. We actually removed the RecordBatch entirely from the UDF interfaces recently to prevent exposing so much information to UDFs. To do something like this we would want to
 define a new interface to UDFs.

One shortcoming that I believe is related to what you are trying to do, is the inability to consider the top level schema of a Drill record in the same way we currently consider the complex map type. Drill currently supports passing non-scalar values in the form of maps and repeated types into UDFs (these maps and lists can be nested within one another to make nearly arbitrarily complex data structures). The interface for passing in these structures is the FieldReader, which is much like an iterator/visitor over a tree structure. The two functions that use this interface today are convertTo_JSON and kvgen (also called mappify). Both of these functions take a complex object as input, convertTo_JSON produces a VarChar with the JSON representation and kvgen applies a transformation to the data to make the key values in a map queryable (more information in the wiki link below[1]).

The important thing to note, is that these functions can only be invoked on a particular field in the schema. It would make sense to allow them to be invoked on the entire root schema, treating it like a map itself, possibly with syntax like convertTo_JSON(*) (NOTE: this is not supported right now, and hasn't even been in a design doc, this will not work today)

For example, these two datasets:

flat schema:
----------------
{
    "a" : 1,
    "b" : 2
}

complex schema:
-----------------------
{
    "data" : {
          "a" : 1,
          "b"
    }
}

The first dataset can only be used to access the individual data members with the syntax: table_name.a

However the second one can pass multiple fields into a function for processing, because the data is stored under a map at the root of the schema, such as producing JSON in a varchar using: convertTo_JSON(data)

If you are willing to change the structure of your incoming data, I think that this might be a viable strategy for passing a variable number of arguments into a function. This will have the restriction today of having a single data type within any lists used, but if there is a discrete number of possible traits you should be able to use a map instead of a list, and nested field within a map can have different data types, i.e you cannot currently have a mixed type array like [1, true, "a string"], but you could put them either in their own fields { "a_number" :1, "a_bool" : true, "a_str" : "a string"} or have lists for each type nested down inside of the map { "list_numbers" : [1], "list_bools" : [ true ], "list_strings" : ["a string"] }

As long as I've written this much, I should say that this alternate strategy will currently only work if you change the source data. We do not support the concept of re-nesting data within the query. Say you wanted to use an array to pass a variable number of arguments. If the source data had the data in separate fields, we currently *do not* support something like select field_1 as new_list[0], field_2 as new_list[1]. Again as before, this hasn't even been fully discussed, so this will not work today and this doesn't represent a declaration of how this may work in Drill in the future, its just to demonstrate what we don't do today. If this feature existed, you could use this new list in an outer query and pass it in as your variable length argument to your function. To do something like this today, you have to modify the source data to put it in this form.

To see how the FieldReader is used, check out this function definition in the Drill source:
org.apache.drill.exec.expr.fn.impl.Mappify

Documentation on its usage in queries
[1] https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function

On Tue, Mar 31, 2015 at 10:24 AM, Shadi Khalifa <kh...@cs.queensu.ca> wrote:

I wonder if I can extract this data from the RecordBatch? any ideas? 
Regards
Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
I'm just a neuron in the society collective brain

01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101 01100111 01111001 01110000 01110100 
P Please consider your environmental responsibility before printing this e-mail




     On Tuesday, March 31, 2015 1:16 PM, Jacques Nadeau <ja...@apache.org> wrote:


 It isn't yet supported but is something I think a lot of people would find
useful.  Depending on how ambitious you are, maybe you could pick it up?

On Tue, Mar 31, 2015 at 10:05 AM, Shadi Khalifa <kh...@cs.queensu.ca>
wrote:

> Hello everyone,
> I wonder if there is a way to send a variable number (Array) of attributes
> (columns) to a custom user defined aggregate function.
> I want to be able to have something like:Select myAggrFn(col1,col2,...,
> coln) from mytable;
>
> I wonder if there is something like the following or anything else that
> can handle this case:@FunctionTemplate(name = "myAggrFn", scope =
> FunctionTemplate.FunctionScope.POINT_AGGREGATE)public static class
> MyAggrFnimplements DrillAggFunc{  @Param ObjectHolder[] in;
>
>  I know it's weird to have a function like that, but I'm implementing
> machine learning into Drill and need to pass some columns or maybe the
> whole row to the aggregate function to train and use the model.
> Regards
> Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
> I'm just a neuron in the society collective brain
>
> 01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> 01100111 01111001 01110000 01110100
> P Please consider your environmental responsibility before printing this
> e-mail
>
>