You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jim Bates <jb...@maprtech.com> on 2015/07/04 19:29:16 UTC

Some questions on UDFs

I have working aggregation and simple UDFs. I've been trying to document
and understand each of the options available in a Drill UDF. Understanding
the different FunctionScope's, the ones that are allowed, the ones that are
not. The impact of different cost categories. The different  steps needed
to understand handling any of the supported data types  and structures in
drill.

Here are a few of my current road blocks. Any pointers would be greatly
appreciated.


   1. I've been trying to understand how to correctly use RepeatedHolders
   of whatever type. For this discussion lets start with a
   RepeatedBigIntHolder. I'm trying to figure out the best way to create a new
   one. I have not figured out where in the existing drill code someone does
   this. If I use a  RepeatedBigIntHolder as a Workspace object is is null to
   start with. I created a new one in the startup section of the udf but the
   vector was null. I can find no reference in creating a new BigIntVector.
   There is a way to create a BigIntVector and I did find an example of
   creating a new VarCharVector but I can't do that using the drill jar files
   from 1.0. The org.apache.drill.common.types.TypeProtos and
   the org.apache.drill.common.types.TypeProtos.MinorType classes do not
   appear to be accessible from the drill jar files.
   2. What is the best way to close out a UDF in the event it generates an
   exception? Are there specific steps one should follow to make a clean exit
   in a catch block that are beneficial to Drill?

Re: Some questions on UDFs

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
I'm not sure, but I don't think you can/should create a BufferAllocator
inside an UDF.

On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:

> I did get a new RepeatedBigIntHolder built and added a BigIntVector added
> to it. I'll try it in the UDF tomorrow and see if there is a difference in
> the ways I found to get a BufferAllocator.
>
> .
> .
> .
> @Inject DrillBuf buffer;
> @Workspace RepeatedBigIntHolder yList;
> .
> .
> .
> @Override
> public void setup() {
> .
> .
> .
> //org.apache.drill.exec.memory.BufferAllocator allocator =
> buffer.getAllocator();
> org.apache.drill.exec.memory.BufferAllocator allocator =  new
> org.apache.drill.exec.memory.TopLevelAllocator();
> yList = new RepeatedBigIntHolder();
> yList.vector = new
>
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
>
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
>
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> allocator);
> .
> .
> .
> }
>
>
>
> On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:
>
> > I still have issues finding the correct way to create and use a
> > RepeatedHolder and Writers are a non starter for Workspace values. I can
> > make do with creating a concatenated string in a VarCharHolder for small
> > data sets to get past this in the short term and finish testing the
> output
> > values I expect but won't be able to do any scale till I figure out how
> to
> > make a repeated list.
> >
> > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> >> Well... Converting from string to integers anyway... To many 4th of July
> >> Hot Dogs. going into nitrate overload. :)
> >>
> >> I am pulling an array of string values from json data. The string values
> >> are actually integers. I am converting to integers and summing each
> >> array entry to the final tally.
> >>
> >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com> wrote:
> >>
> >>> Ted,
> >>>
> >>> Yes, I started out just getting a basic count to work. I am trying to
> >>> keep the workflow as close to a basic user as possible. As such, I am
> >>> building and using the MapR Apache Drill sandbox to test.
> >>>
> >>>
> >>>    1. Always look at the drillbits.log file to see if drill had any
> >>>    issues loading your UDF. That was where I learned that all
> workspace values
> >>>    needed to be holders
> >>>       -
> >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
> >>>       function class
> >>>
>  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
> >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> variable 'xList'
> >>>       is of type 'interface
> >>>
>  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> >>>       Please change it to Holder type.
> >>>    2. Error messages:
> >>>       - If you get an error in this format it means that Drill can not
> >>>       find your function so it probably didn't load it. back to step 1:
> >>>          -
> >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
> >>>          match found for function signature MyFunctionName(<ANY>)
> >>>       - If you get an error in this format it means that the function
> >>>       is there but Drill could not find a signature that matched the
> param types
> >>>       or param numbers you were passing it. The exact wording will
> change but
> >>>       the Missing function implementation is the key phrase to look
> for:
> >>>          -
> >>>          - Error: SYSTEM ERROR:
> >>>          org.apache.drill.exec.exception.SchemaChangeException:
> Failure while trying
> >>>          to materialize incoming schema.  Errors:
> >>>          - Error in expression at index -1.  Error: Missing function
> >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> expression: --UNKNOWN
> >>>          EXPRESSION--
> >>>       3. In your function definition for aggregate functions you need
> >>>    to set null processing to internal and your isRandom to false.
> Example
> >>>    below:
> >>>       -
> >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>>       isBinaryCommutative = false, costCategory =
> >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>>
> >>> Below is an example from the Apache Drill tutorial data sets contained
> >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> values
> >>> from json data. The string values are actually integers. I am
> converting to
> >>> string and summing each array entry to the final tally. This in no way
> >>> represents what this data was for but it did become a handy way for me
> to
> >>> peck out the "correct" way to build an aggregation UDF function
> >>>
> >>> @FunctionTemplate(name = "MyArraySum", scope =
> >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>> isBinaryCommutative = false, costCategory =
> >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>> public static class MyArraySum implements DrillAggFunc {
> >>>
> >>> @Param RepeatedVarCharHolder listToSearch;
> >>> @Workspace NullableBigIntHolder count;
> >>> @Workspace NullableBigIntHolder sum;
> >>> @Workspace NullableVarCharHolder vc;
> >>> @Output BigIntHolder out;
> >>>
> >>> @Override
> >>> public void setup() {
> >>> count.value=0;
> >>> sum.value = 0;
> >>> }
> >>>
> >>> @Override
> >>> public void add() {
> >>> int c = listToSearch.end - listToSearch.start;
> >>> int val = 0;
> >>> try {
> >>> for(int i=0; i<c; i++){
> >>> listToSearch.vector.getAccessor().get(i, vc);
> >>> String inputStr =
> >>>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> >>> vc.end, vc.buffer);
> >>> val = Integer.parseInt(inputStr);
> >>> sum.value = sum.value + val;
> >>> }
> >>> } catch (Exception e) {
> >>> val = 0;
> >>> }
> >>> count.value = count.value + 1;
> >>> }
> >>>
> >>> Example select statement:
> >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
> >>>
> >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> Jim,
> >>>>
> >>>> I think that you may be having trouble with aggregators in general.
> >>>>
> >>>> Have you been able to build *any* aggregator of anything?  I haven't.
> >>>>
> >>>> When I try to build an aggregator of int's or doubles, I get a very
> >>>> persistent problem with Drill even seeing my aggregates:
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> >>>> cp.`employee.json`;*
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<ANY>)
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 27: No match found for function signature
> >>>> sum_int(<ANY>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
> >>>> match
> >>>> found for function signature sum_int(<ANY>)*
> >>>>
> >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
> >>>> cp.`employee.json`*;
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<NUMERIC>)
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 40: No match found for function signature
> >>>> sum_int(<NUMERIC>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
> >>>> match
> >>>> found for function signature sum_int(<NUMERIC>)*
> >>>>
> >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local>
> >>>>
> >>>>
> >>>> It looks like there is some undocumented subtlety about how to
> register
> >>>> an
> >>>> aggregator.
> >>>>
> >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> >>>>
> >>>> > I'm working on the same thing. I want to aggregate a list of values.
> >>>> It has
> >>>> > been a search and guess game for the most part. I'm still stuck in
> the
> >>>> > process of getting the values all into a list. The writers look
> >>>> interesting
> >>>> > but for aggregation functions  it looks like the input is the param
> >>>> and
> >>>> > output objects can't hold the aggregations steps. The Workspace is
> >>>> where
> >>>> > that happens. If I try and use a Writer in a workspace it won't load
> >>>> and
> >>>> > tells me to change it to Holders which was why I was using them to
> >>>> start
> >>>> > with. Maybe I'm missing the architecture of the agg function. It
> >>>> looked
> >>>> > like it was....
> >>>> >
> >>>> > @Param comes in -> initialize @Workspace vars in setup -> process
> data
> >>>> > through @Workspace vars in add -> finalize @Output in output.
> >>>> >
> >>>> > So I'm back to trying to figure out how to create a
> >>>> RepeatedBigIntHolder or
> >>>> > a RepeatedVarCharHolder...
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > I am working on trying to build any kind of list constructing
> >>>> aggregator
> >>>> > > and having absolute fits.
> >>>> > >
> >>>> > > To simplify life, I decided to just build a generic list builder
> >>>> that is
> >>>> > a
> >>>> > > scalar function that returns a list containing its argument.  Thus
> >>>> > zoop(3)
> >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> >>>> > >
> >>>> > > The ComplexWriter looks like the place to go. As usual, the
> >>>> complete lack
> >>>> > > of comments in most of Drill makes this very hard since I have to
> >>>> guess
> >>>> > > what works and what doesn't.
> >>>> > >
> >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> >>>> method.  I
> >>>> > > used this in zip and it works nicely to construct lists for
> >>>> output.  I
> >>>> > note
> >>>> > > that the resulting ListWriter has a method copyReader(FieldReader
> >>>> var1)
> >>>> > > which looks really good.
> >>>> > >
> >>>> > > Unfortunately, the only implementation of copyReader() is in
> >>>> > > AbstractFieldWriter and it looks this:
> >>>> > >
> >>>> > > public void copyReader(FieldReader reader) {
> >>>> > >     this.fail("Copy FieldReader");
> >>>> > > }
> >>>> > >
> >>>> > > I would like to formally say at this point "WTF"?
> >>>> > >
> >>>> > > In digging in further, I see other methods that look handy like
> >>>> > >
> >>>> > > public void write(IntHolder holder) {
> >>>> > >     this.fail("Int");
> >>>> > > }
> >>>> > >
> >>>> > > And then in looking at implementations, it looks like there is a
> >>>> > > combinatorial explosion because every type seems to need a write
> >>>> method
> >>>> > for
> >>>> > > every other type.
> >>>> > >
> >>>> > > What is the thought here?  How can I copy an arbitrary value into
> a
> >>>> list?
> >>>> > >
> >>>> > > My next thought was to build code that dispatches on type.  There
> >>>> is a
> >>>> > > method called getType() on the FieldReader.  Unfortunately, that
> >>>> drives
> >>>> > > into code generated by protoc and I see no way to dispatch on the
> >>>> type of
> >>>> > > an incoming value.
> >>>> > >
> >>>> > >
> >>>> > > How is this supposed to work?
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> baid.mehant@gmail.com>
> >>>> > wrote:
> >>>> > >
> >>>> > > > For a detailed example on using ComplexWriter interface you can
> >>>> take a
> >>>> > > look
> >>>> > > > at the Mappify
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> >>>> > > > >
> >>>> > > > (kvgen) function. The function itself is very simple however it
> >>>> makes
> >>>> > use
> >>>> > > > of the utility methods in MappifyUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> >>>> > > > >
> >>>> > > > and MapUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> >>>> > > > >
> >>>> > > > which perform most of the work.
> >>>> > > >
> >>>> > > > Currently we don't have a generic infrastructure to handle
> errors
> >>>> > coming
> >>>> > > > out of functions. However there is UserException, which when
> >>>> raised
> >>>> > will
> >>>> > > > make sure that Drill does not gobble up the error message in
> that
> >>>> > > > exception. So you can probably throw a UserException with the
> >>>> failing
> >>>> > > input
> >>>> > > > in your function to make sure it propagates to the user.
> >>>> > > >
> >>>> > > > Thanks
> >>>> > > > Mehant
> >>>> > > >
> >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> >>>> jacques@apache.org>
> >>>> > > wrote:
> >>>> > > >
> >>>> > > > > *Holders are for both input and output.  You can also use
> >>>> > CompleWriter
> >>>> > > > for
> >>>> > > > > output and FieldReader for input if you want to write or read
> a
> >>>> > complex
> >>>> > > > > value.
> >>>> > > > >
> >>>> > > > > I don't think we've provided a really clean way to construct a
> >>>> > > > > Repeated*Holder for output purposes.  You can probably do it
> by
> >>>> > > reaching
> >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
> would
> >>>> > > recommend
> >>>> > > > > using the ComplexWriter output pattern for now.  This will be
> a
> >>>> > little
> >>>> > > > less
> >>>> > > > > efficient but substantially less brittle.  I suggest you open
> >>>> up a
> >>>> > jira
> >>>> > > > for
> >>>> > > > > using a Repeated*Holder as an output.
> >>>> > > > >
> >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> >>>> ted.dunning@gmail.com>
> >>>> > > > wrote:
> >>>> > > > >
> >>>> > > > > > Holders are for input, I think.
> >>>> > > > > >
> >>>> > > > > > Try the different kinds of writers.
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> >>>> jbates@maprtech.com>
> >>>> > > > wrote:
> >>>> > > > > >
> >>>> > > > > > > Using a repeatedholder as a @param I've got working. I was
> >>>> > working
> >>>> > > > on a
> >>>> > > > > > > custom aggregator function using DrillAggFunc. In this I
> >>>> can do
> >>>> > > > simple
> >>>> > > > > > > things but If I want to build a list values and do
> >>>> something with
> >>>> > > it
> >>>> > > > in
> >>>> > > > > > the
> >>>> > > > > > > final output method I think I need to use RepeatedHolders
> >>>> in the
> >>>> > > > > > > @Workspace. To do that I need to create a new one in the
> >>>> setup
> >>>> > > > method.
> >>>> > > > > I
> >>>> > > > > > > can't get one built. They all require a BufferAllocator to
> >>>> be
> >>>> > > passed
> >>>> > > > in
> >>>> > > > > > to
> >>>> > > > > > > build it. I have not found a way to get an allocator yet.
> >>>> Any
> >>>> > > > > > suggestions?
> >>>> > > > > > >
> >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> >>>> > ted.dunning@gmail.com
> >>>> > > >
> >>>> > > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > > If you look at the zip function in
> >>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions
> you
> >>>> can
> >>>> > > have
> >>>> > > > an
> >>>> > > > > > > > example of building a structure.
> >>>> > > > > > > >
> >>>> > > > > > > > The basic idea is that your output is denoted as
> >>>> > > > > > > >
> >>>> > > > > > > >         @Output
> >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> >>>> > > > > > > >
> >>>> > > > > > > > The pattern for building a list of lists of integers is
> >>>> like
> >>>> > > this:
> >>>> > > > > > > >
> >>>> > > > > > > >         writer.setValueCount(n);
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         BaseWriter.ListWriter outer =
> writer.rootAsList();
> >>>> > > > > > > >         outer.start(); // [ outer list
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         // for each inner list
> >>>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
> >>>> > > > > > > >             inner.start();
> >>>> > > > > > > >             // for each inner list element
> >>>> > > > > > > >
>  inner.integer().writeInt(accessor.get(i));
> >>>> > > > > > > >             }
> >>>> > > > > > > >             inner.end();   // ] inner list
> >>>> > > > > > > >         }
> >>>> > > > > > > >         outer.end(); // ] outer list
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> >>>> > jbates@maprtech.com>
> >>>> > > > > > wrote:
> >>>> > > > > > > >
> >>>> > > > > > > > > I have working aggregation and simple UDFs. I've been
> >>>> trying
> >>>> > to
> >>>> > > > > > > document
> >>>> > > > > > > > > and understand each of the options available in a
> Drill
> >>>> UDF.
> >>>> > > > > > > > Understanding
> >>>> > > > > > > > > the different FunctionScope's, the ones that are
> >>>> allowed, the
> >>>> > > > ones
> >>>> > > > > > that
> >>>> > > > > > > > are
> >>>> > > > > > > > > not. The impact of different cost categories. The
> >>>> different
> >>>> > > > steps
> >>>> > > > > > > needed
> >>>> > > > > > > > > to understand handling any of the supported data types
> >>>> and
> >>>> > > > > > structures
> >>>> > > > > > > in
> >>>> > > > > > > > > drill.
> >>>> > > > > > > > >
> >>>> > > > > > > > > Here are a few of my current road blocks. Any pointers
> >>>> would
> >>>> > be
> >>>> > > > > > greatly
> >>>> > > > > > > > > appreciated.
> >>>> > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > > >    1. I've been trying to understand how to correctly
> >>>> use
> >>>> > > > > > > RepeatedHolders
> >>>> > > > > > > > >    of whatever type. For this discussion lets start
> >>>> with a
> >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the
> >>>> best
> >>>> > way
> >>>> > > to
> >>>> > > > > > > create
> >>>> > > > > > > > a
> >>>> > > > > > > > > new
> >>>> > > > > > > > >    one. I have not figured out where in the existing
> >>>> drill
> >>>> > code
> >>>> > > > > > someone
> >>>> > > > > > > > > does
> >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> Workspace
> >>>> > object
> >>>> > > > is
> >>>> > > > > is
> >>>> > > > > > > > null
> >>>> > > > > > > > > to
> >>>> > > > > > > > >    start with. I created a new one in the startup
> >>>> section of
> >>>> > > the
> >>>> > > > > udf
> >>>> > > > > > > but
> >>>> > > > > > > > > the
> >>>> > > > > > > > >    vector was null. I can find no reference in
> creating
> >>>> a new
> >>>> > > > > > > > BigIntVector.
> >>>> > > > > > > > >    There is a way to create a BigIntVector and I did
> >>>> find an
> >>>> > > > > example
> >>>> > > > > > of
> >>>> > > > > > > > >    creating a new VarCharVector but I can't do that
> >>>> using the
> >>>> > > > drill
> >>>> > > > > > jar
> >>>> > > > > > > > > files
> >>>> > > > > > > > >    from 1.0. The
> >>>> org.apache.drill.common.types.TypeProtos and
> >>>> > > > > > > > >    the
> >>>> org.apache.drill.common.types.TypeProtos.MinorType
> >>>> > > classes
> >>>> > > > > do
> >>>> > > > > > > not
> >>>> > > > > > > > >    appear to be accessible from the drill jar files.
> >>>> > > > > > > > >    2. What is the best way to close out a UDF in the
> >>>> event it
> >>>> > > > > > generates
> >>>> > > > > > > > an
> >>>> > > > > > > > >    exception? Are there specific steps one should
> >>>> follow to
> >>>> > > make
> >>>> > > > a
> >>>> > > > > > > clean
> >>>> > > > > > > > > exit
> >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Well... After much guess work I got it to work... At least in regards to
intellectual curiosity, real world use is a different story. Thanks to
those who attempted to assist.

Query:
SELECT MyList(test_field1)  FROM (SELECT test_field1 FROM
`hive.default`.`my_hive_table` limit 10);
+----------------------------------------------------------+
|                          EXPR$0                          |
+----------------------------------------------------------+
| [18108,19719,15559,14152,18577,17170,13010,11603,16028]  |
+----------------------------------------------------------+

Function:
@FunctionTemplate(name = "MyList", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
isBinaryCommutative = false, costCategory =
FunctionTemplate.FunctionCostCategory.COMPLEX)
public static class MyList implements DrillAggFunc {

@Param NullableBigIntHolder xValue;
@Inject DrillBuf buffer;
@Workspace IntHolder count;
@Workspace RepeatedBigIntHolder xList;
@Output RepeatedBigIntHolder out;

@Override
public void setup() {
count = new IntHolder();
count.value=0;
org.apache.drill.exec.memory.BufferAllocator allocator =  new
org.apache.drill.exec.memory.TopLevelAllocator();
xList = new RepeatedBigIntHolder();
xList.vector = new
org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
org.apache.drill.common.expression.SchemaPath("bigints",
org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
allocator);
org.apache.drill.exec.vector.AllocationHelper.allocate(xList.vector, 100,
50);
xList.vector.getMutator().generateTestData(100);
xList.vector.getMutator().setValueCount(100);
}

@Override
public void add() {
if (xValue == null){
return;
}
int size = xList.end - xList.start;
if (count.value + 1 > size ){
xList.vector.getMutator().setValueCount(count.value + 1);
}
xList.vector.getMutator().setSafe(count.value, xValue.value);
xList.end=count.value;
count.value = count.value + 1;
}

@Override
public void output() {
        out.vector = xList.vector;
        out.start = xList.start;
        out.end = xList.end;
}

@Override
public void reset() {
}
}

On Sun, Jul 5, 2015 at 2:43 PM, Jim Bates <jb...@maprtech.com> wrote:

> I agree. I've gone way to deep into drill to try and get this done. While
> it became clear to me that this is most likely not the way to do it.... It
> has been a good learning experience around aggregation UDFs. I should be
> able to put a lot back into the docs on this as soon as I figure out how to
> get access to contribute to the docs. I'll file a JIRA on this but in the
> short term I would still like to finish off what I started. I have my
> RepeatedBigIntHolders that now no longer blow up when I try and push data
> into them. Now I'm trying to see if I can get anything back out.
>
> FunFunFun.
>
> On Sun, Jul 5, 2015 at 1:50 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
>> It isn't obvious because you shouldn't do it.  Please file a JIRA to add
>> real support for this type of output.
>>
>> Your current function would leak large amounts of memory that would
>> ultimately crash the node.
>>
>> Realistically, there are very few internal Drill APIs that you should
>> access via a UDF (injectables, holders, complexwriter, fieldreader and
>> helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
>> people don't accidentally reach into Drill's internals.  (A later
>> possibility is bytecode weaving to completely protect against it).
>>
>> J
>>
>> On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > That was impressively non-obvious.
>> >
>> >
>> >
>> > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:
>> >
>> > > I did get a new RepeatedBigIntHolder built and added a BigIntVector
>> added
>> > > to it. I'll try it in the UDF tomorrow and see if there is a
>> difference
>> > in
>> > > the ways I found to get a BufferAllocator.
>> > >
>> > > .
>> > > .
>> > > .
>> > > @Inject DrillBuf buffer;
>> > > @Workspace RepeatedBigIntHolder yList;
>> > > .
>> > > .
>> > > .
>> > > @Override
>> > > public void setup() {
>> > > .
>> > > .
>> > > .
>> > > //org.apache.drill.exec.memory.BufferAllocator allocator =
>> > > buffer.getAllocator();
>> > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
>> > > org.apache.drill.exec.memory.TopLevelAllocator();
>> > > yList = new RepeatedBigIntHolder();
>> > > yList.vector = new
>> > >
>> > >
>> >
>> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
>> > >
>> > >
>> >
>> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
>> > >
>> > >
>> >
>> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
>> > > allocator);
>> > > .
>> > > .
>> > > .
>> > > }
>> > >
>> > >
>> > >
>> > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com>
>> wrote:
>> > >
>> > > > I still have issues finding the correct way to create and use a
>> > > > RepeatedHolder and Writers are a non starter for Workspace values. I
>> > can
>> > > > make do with creating a concatenated string in a VarCharHolder for
>> > small
>> > > > data sets to get past this in the short term and finish testing the
>> > > output
>> > > > values I expect but won't be able to do any scale till I figure out
>> how
>> > > to
>> > > > make a repeated list.
>> > > >
>> > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com>
>> wrote:
>> > > >
>> > > >> Well... Converting from string to integers anyway... To many 4th of
>> > July
>> > > >> Hot Dogs. going into nitrate overload. :)
>> > > >>
>> > > >> I am pulling an array of string values from json data. The string
>> > values
>> > > >> are actually integers. I am converting to integers and summing each
>> > > >> array entry to the final tally.
>> > > >>
>> > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
>> > wrote:
>> > > >>
>> > > >>> Ted,
>> > > >>>
>> > > >>> Yes, I started out just getting a basic count to work. I am
>> trying to
>> > > >>> keep the workflow as close to a basic user as possible. As such,
>> I am
>> > > >>> building and using the MapR Apache Drill sandbox to test.
>> > > >>>
>> > > >>>
>> > > >>>    1. Always look at the drillbits.log file to see if drill had
>> any
>> > > >>>    issues loading your UDF. That was where I learned that all
>> > > workspace values
>> > > >>>    needed to be holders
>> > > >>>       -
>> > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
>> loading
>> > > >>>       function class
>> > > >>>
>> > >  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
>> > field
>> > > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
>> > > variable 'xList'
>> > > >>>       is of type 'interface
>> > > >>>
>> > >
>> org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>> > > >>>       Please change it to Holder type.
>> > > >>>    2. Error messages:
>> > > >>>       - If you get an error in this format it means that Drill can
>> > not
>> > > >>>       find your function so it probably didn't load it. back to
>> step
>> > 1:
>> > > >>>          -
>> > > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column
>> 44:
>> > No
>> > > >>>          match found for function signature MyFunctionName(<ANY>)
>> > > >>>       - If you get an error in this format it means that the
>> function
>> > > >>>       is there but Drill could not find a signature that matched
>> the
>> > > param types
>> > > >>>       or param numbers you were passing it. The exact wording will
>> > > change but
>> > > >>>       the Missing function implementation is the key phrase to
>> look
>> > > for:
>> > > >>>          -
>> > > >>>          - Error: SYSTEM ERROR:
>> > > >>>          org.apache.drill.exec.exception.SchemaChangeException:
>> > > Failure while trying
>> > > >>>          to materialize incoming schema.  Errors:
>> > > >>>          - Error in expression at index -1.  Error: Missing
>> function
>> > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
>> > > expression: --UNKNOWN
>> > > >>>          EXPRESSION--
>> > > >>>       3. In your function definition for aggregate functions you
>> need
>> > > >>>    to set null processing to internal and your isRandom to false.
>> > > Example
>> > > >>>    below:
>> > > >>>       -
>> > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
>> > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>> > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>> > > >>>       isBinaryCommutative = false, costCategory =
>> > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>> > > >>>
>> > > >>> Below is an example from the Apache Drill tutorial data sets
>> > contained
>> > > >>> in the MapR Apache Drill sandbox. I am pulling an array if string
>> > > values
>> > > >>> from json data. The string values are actually integers. I am
>> > > converting to
>> > > >>> string and summing each array entry to the final tally. This in no
>> > way
>> > > >>> represents what this data was for but it did become a handy way
>> for
>> > me
>> > > to
>> > > >>> peck out the "correct" way to build an aggregation UDF function
>> > > >>>
>> > > >>> @FunctionTemplate(name = "MyArraySum", scope =
>> > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>> > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>> > > >>> isBinaryCommutative = false, costCategory =
>> > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
>> > > >>> public static class MyArraySum implements DrillAggFunc {
>> > > >>>
>> > > >>> @Param RepeatedVarCharHolder listToSearch;
>> > > >>> @Workspace NullableBigIntHolder count;
>> > > >>> @Workspace NullableBigIntHolder sum;
>> > > >>> @Workspace NullableVarCharHolder vc;
>> > > >>> @Output BigIntHolder out;
>> > > >>>
>> > > >>> @Override
>> > > >>> public void setup() {
>> > > >>> count.value=0;
>> > > >>> sum.value = 0;
>> > > >>> }
>> > > >>>
>> > > >>> @Override
>> > > >>> public void add() {
>> > > >>> int c = listToSearch.end - listToSearch.start;
>> > > >>> int val = 0;
>> > > >>> try {
>> > > >>> for(int i=0; i<c; i++){
>> > > >>> listToSearch.vector.getAccessor().get(i, vc);
>> > > >>> String inputStr =
>> > > >>>
>> > >
>> >
>> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
>> > > >>> vc.end, vc.buffer);
>> > > >>> val = Integer.parseInt(inputStr);
>> > > >>> sum.value = sum.value + val;
>> > > >>> }
>> > > >>> } catch (Exception e) {
>> > > >>> val = 0;
>> > > >>> }
>> > > >>> count.value = count.value + 1;
>> > > >>> }
>> > > >>>
>> > > >>> Example select statement:
>> > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
>> > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t
>> limit
>> > 5);
>> > > >>>
>> > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > > >>> wrote:
>> > > >>>
>> > > >>>> Jim,
>> > > >>>>
>> > > >>>> I think that you may be having trouble with aggregators in
>> general.
>> > > >>>>
>> > > >>>> Have you been able to build *any* aggregator of anything?  I
>> > haven't.
>> > > >>>>
>> > > >>>> When I try to build an aggregator of int's or doubles, I get a
>> very
>> > > >>>> persistent problem with Drill even seeing my aggregates:
>> > > >>>>
>> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>> > > >>>> cp.`employee.json`;*
>> > > >>>>
>> > > >>>> Jul 04, 2015 4:19:35 PM
>> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>> > > >>>>
>> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
>> > > match
>> > > >>>> found for function signature sum_int(<ANY>)
>> > > >>>>
>> > > >>>> Jul 04, 2015 4:19:35 PM
>> org.apache.calcite.runtime.CalciteException
>> > > >>>> <init>
>> > > >>>>
>> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
>> > line
>> > > 1,
>> > > >>>> column 8 to line 1, column 27: No match found for function
>> signature
>> > > >>>> sum_int(<ANY>)
>> > > >>>>
>> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27:
>> No
>> > > >>>> match
>> > > >>>> found for function signature sum_int(<ANY>)*
>> > > >>>>
>> > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
>> 10.0.1.2:31010
>> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
>> > > >>>>
>> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int))
>> > from
>> > > >>>> cp.`employee.json`*;
>> > > >>>>
>> > > >>>> Jul 04, 2015 4:19:45 PM
>> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>> > > >>>>
>> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
>> > > match
>> > > >>>> found for function signature sum_int(<NUMERIC>)
>> > > >>>>
>> > > >>>> Jul 04, 2015 4:19:45 PM
>> org.apache.calcite.runtime.CalciteException
>> > > >>>> <init>
>> > > >>>>
>> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
>> > line
>> > > 1,
>> > > >>>> column 8 to line 1, column 40: No match found for function
>> signature
>> > > >>>> sum_int(<NUMERIC>)
>> > > >>>>
>> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40:
>> No
>> > > >>>> match
>> > > >>>> found for function signature sum_int(<NUMERIC>)*
>> > > >>>>
>> > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
>> 10.0.1.2:31010
>> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
>> > > >>>>
>> > > >>>> 0: jdbc:drill:zk=local>
>> > > >>>>
>> > > >>>>
>> > > >>>> It looks like there is some undocumented subtlety about how to
>> > > register
>> > > >>>> an
>> > > >>>> aggregator.
>> > > >>>>
>> > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
>> > > wrote:
>> > > >>>>
>> > > >>>> > I'm working on the same thing. I want to aggregate a list of
>> > values.
>> > > >>>> It has
>> > > >>>> > been a search and guess game for the most part. I'm still
>> stuck in
>> > > the
>> > > >>>> > process of getting the values all into a list. The writers look
>> > > >>>> interesting
>> > > >>>> > but for aggregation functions  it looks like the input is the
>> > param
>> > > >>>> and
>> > > >>>> > output objects can't hold the aggregations steps. The
>> Workspace is
>> > > >>>> where
>> > > >>>> > that happens. If I try and use a Writer in a workspace it won't
>> > load
>> > > >>>> and
>> > > >>>> > tells me to change it to Holders which was why I was using
>> them to
>> > > >>>> start
>> > > >>>> > with. Maybe I'm missing the architecture of the agg function.
>> It
>> > > >>>> looked
>> > > >>>> > like it was....
>> > > >>>> >
>> > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
>> process
>> > > data
>> > > >>>> > through @Workspace vars in add -> finalize @Output in output.
>> > > >>>> >
>> > > >>>> > So I'm back to trying to figure out how to create a
>> > > >>>> RepeatedBigIntHolder or
>> > > >>>> > a RepeatedVarCharHolder...
>> > > >>>> >
>> > > >>>> >
>> > > >>>> >
>> > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
>> > ted.dunning@gmail.com>
>> > > >>>> wrote:
>> > > >>>> >
>> > > >>>> > > I am working on trying to build any kind of list constructing
>> > > >>>> aggregator
>> > > >>>> > > and having absolute fits.
>> > > >>>> > >
>> > > >>>> > > To simplify life, I decided to just build a generic list
>> builder
>> > > >>>> that is
>> > > >>>> > a
>> > > >>>> > > scalar function that returns a list containing its argument.
>> > Thus
>> > > >>>> > zoop(3)
>> > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>> > > >>>> > >
>> > > >>>> > > The ComplexWriter looks like the place to go. As usual, the
>> > > >>>> complete lack
>> > > >>>> > > of comments in most of Drill makes this very hard since I
>> have
>> > to
>> > > >>>> guess
>> > > >>>> > > what works and what doesn't.
>> > > >>>> > >
>> > > >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>> > > >>>> method.  I
>> > > >>>> > > used this in zip and it works nicely to construct lists for
>> > > >>>> output.  I
>> > > >>>> > note
>> > > >>>> > > that the resulting ListWriter has a method
>> > copyReader(FieldReader
>> > > >>>> var1)
>> > > >>>> > > which looks really good.
>> > > >>>> > >
>> > > >>>> > > Unfortunately, the only implementation of copyReader() is in
>> > > >>>> > > AbstractFieldWriter and it looks this:
>> > > >>>> > >
>> > > >>>> > > public void copyReader(FieldReader reader) {
>> > > >>>> > >     this.fail("Copy FieldReader");
>> > > >>>> > > }
>> > > >>>> > >
>> > > >>>> > > I would like to formally say at this point "WTF"?
>> > > >>>> > >
>> > > >>>> > > In digging in further, I see other methods that look handy
>> like
>> > > >>>> > >
>> > > >>>> > > public void write(IntHolder holder) {
>> > > >>>> > >     this.fail("Int");
>> > > >>>> > > }
>> > > >>>> > >
>> > > >>>> > > And then in looking at implementations, it looks like there
>> is a
>> > > >>>> > > combinatorial explosion because every type seems to need a
>> write
>> > > >>>> method
>> > > >>>> > for
>> > > >>>> > > every other type.
>> > > >>>> > >
>> > > >>>> > > What is the thought here?  How can I copy an arbitrary value
>> > into
>> > > a
>> > > >>>> list?
>> > > >>>> > >
>> > > >>>> > > My next thought was to build code that dispatches on type.
>> > There
>> > > >>>> is a
>> > > >>>> > > method called getType() on the FieldReader.  Unfortunately,
>> that
>> > > >>>> drives
>> > > >>>> > > into code generated by protoc and I see no way to dispatch on
>> > the
>> > > >>>> type of
>> > > >>>> > > an incoming value.
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > > How is this supposed to work?
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > >
>> > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
>> > > baid.mehant@gmail.com>
>> > > >>>> > wrote:
>> > > >>>> > >
>> > > >>>> > > > For a detailed example on using ComplexWriter interface you
>> > can
>> > > >>>> take a
>> > > >>>> > > look
>> > > >>>> > > > at the Mappify
>> > > >>>> > > > <
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>> > > >>>> > > > >
>> > > >>>> > > > (kvgen) function. The function itself is very simple
>> however
>> > it
>> > > >>>> makes
>> > > >>>> > use
>> > > >>>> > > > of the utility methods in MappifyUtility
>> > > >>>> > > > <
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>> > > >>>> > > > >
>> > > >>>> > > > and MapUtility
>> > > >>>> > > > <
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>> > > >>>> > > > >
>> > > >>>> > > > which perform most of the work.
>> > > >>>> > > >
>> > > >>>> > > > Currently we don't have a generic infrastructure to handle
>> > > errors
>> > > >>>> > coming
>> > > >>>> > > > out of functions. However there is UserException, which
>> when
>> > > >>>> raised
>> > > >>>> > will
>> > > >>>> > > > make sure that Drill does not gobble up the error message
>> in
>> > > that
>> > > >>>> > > > exception. So you can probably throw a UserException with
>> the
>> > > >>>> failing
>> > > >>>> > > input
>> > > >>>> > > > in your function to make sure it propagates to the user.
>> > > >>>> > > >
>> > > >>>> > > > Thanks
>> > > >>>> > > > Mehant
>> > > >>>> > > >
>> > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
>> > > >>>> jacques@apache.org>
>> > > >>>> > > wrote:
>> > > >>>> > > >
>> > > >>>> > > > > *Holders are for both input and output.  You can also use
>> > > >>>> > CompleWriter
>> > > >>>> > > > for
>> > > >>>> > > > > output and FieldReader for input if you want to write or
>> > read
>> > > a
>> > > >>>> > complex
>> > > >>>> > > > > value.
>> > > >>>> > > > >
>> > > >>>> > > > > I don't think we've provided a really clean way to
>> > construct a
>> > > >>>> > > > > Repeated*Holder for output purposes.  You can probably
>> do it
>> > > by
>> > > >>>> > > reaching
>> > > >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
>> > > would
>> > > >>>> > > recommend
>> > > >>>> > > > > using the ComplexWriter output pattern for now.  This
>> will
>> > be
>> > > a
>> > > >>>> > little
>> > > >>>> > > > less
>> > > >>>> > > > > efficient but substantially less brittle.  I suggest you
>> > open
>> > > >>>> up a
>> > > >>>> > jira
>> > > >>>> > > > for
>> > > >>>> > > > > using a Repeated*Holder as an output.
>> > > >>>> > > > >
>> > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>> > > >>>> ted.dunning@gmail.com>
>> > > >>>> > > > wrote:
>> > > >>>> > > > >
>> > > >>>> > > > > > Holders are for input, I think.
>> > > >>>> > > > > >
>> > > >>>> > > > > > Try the different kinds of writers.
>> > > >>>> > > > > >
>> > > >>>> > > > > >
>> > > >>>> > > > > >
>> > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
>> > > >>>> jbates@maprtech.com>
>> > > >>>> > > > wrote:
>> > > >>>> > > > > >
>> > > >>>> > > > > > > Using a repeatedholder as a @param I've got working.
>> I
>> > was
>> > > >>>> > working
>> > > >>>> > > > on a
>> > > >>>> > > > > > > custom aggregator function using DrillAggFunc. In
>> this I
>> > > >>>> can do
>> > > >>>> > > > simple
>> > > >>>> > > > > > > things but If I want to build a list values and do
>> > > >>>> something with
>> > > >>>> > > it
>> > > >>>> > > > in
>> > > >>>> > > > > > the
>> > > >>>> > > > > > > final output method I think I need to use
>> > RepeatedHolders
>> > > >>>> in the
>> > > >>>> > > > > > > @Workspace. To do that I need to create a new one in
>> the
>> > > >>>> setup
>> > > >>>> > > > method.
>> > > >>>> > > > > I
>> > > >>>> > > > > > > can't get one built. They all require a
>> BufferAllocator
>> > to
>> > > >>>> be
>> > > >>>> > > passed
>> > > >>>> > > > in
>> > > >>>> > > > > > to
>> > > >>>> > > > > > > build it. I have not found a way to get an allocator
>> > yet.
>> > > >>>> Any
>> > > >>>> > > > > > suggestions?
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
>> > > >>>> > ted.dunning@gmail.com
>> > > >>>> > > >
>> > > >>>> > > > > > wrote:
>> > > >>>> > > > > > >
>> > > >>>> > > > > > > > If you look at the zip function in
>> > > >>>> > > > > > > >
>> https://github.com/mapr-demos/simple-drill-functions
>> > > you
>> > > >>>> can
>> > > >>>> > > have
>> > > >>>> > > > an
>> > > >>>> > > > > > > > example of building a structure.
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > > The basic idea is that your output is denoted as
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > >         @Output
>> > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > > The pattern for building a list of lists of
>> integers
>> > is
>> > > >>>> like
>> > > >>>> > > this:
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > >         writer.setValueCount(n);
>> > > >>>> > > > > > > >         ...
>> > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
>> > > writer.rootAsList();
>> > > >>>> > > > > > > >         outer.start(); // [ outer list
>> > > >>>> > > > > > > >         ...
>> > > >>>> > > > > > > >         // for each inner list
>> > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
>> > outer.list();
>> > > >>>> > > > > > > >             inner.start();
>> > > >>>> > > > > > > >             // for each inner list element
>> > > >>>> > > > > > > >
>> > >  inner.integer().writeInt(accessor.get(i));
>> > > >>>> > > > > > > >             }
>> > > >>>> > > > > > > >             inner.end();   // ] inner list
>> > > >>>> > > > > > > >         }
>> > > >>>> > > > > > > >         outer.end(); // ] outer list
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
>> > > >>>> > jbates@maprtech.com>
>> > > >>>> > > > > > wrote:
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > > > > I have working aggregation and simple UDFs. I've
>> > been
>> > > >>>> trying
>> > > >>>> > to
>> > > >>>> > > > > > > document
>> > > >>>> > > > > > > > > and understand each of the options available in a
>> > > Drill
>> > > >>>> UDF.
>> > > >>>> > > > > > > > Understanding
>> > > >>>> > > > > > > > > the different FunctionScope's, the ones that are
>> > > >>>> allowed, the
>> > > >>>> > > > ones
>> > > >>>> > > > > > that
>> > > >>>> > > > > > > > are
>> > > >>>> > > > > > > > > not. The impact of different cost categories. The
>> > > >>>> different
>> > > >>>> > > > steps
>> > > >>>> > > > > > > needed
>> > > >>>> > > > > > > > > to understand handling any of the supported data
>> > types
>> > > >>>> and
>> > > >>>> > > > > > structures
>> > > >>>> > > > > > > in
>> > > >>>> > > > > > > > > drill.
>> > > >>>> > > > > > > > >
>> > > >>>> > > > > > > > > Here are a few of my current road blocks. Any
>> > pointers
>> > > >>>> would
>> > > >>>> > be
>> > > >>>> > > > > > greatly
>> > > >>>> > > > > > > > > appreciated.
>> > > >>>> > > > > > > > >
>> > > >>>> > > > > > > > >
>> > > >>>> > > > > > > > >    1. I've been trying to understand how to
>> > correctly
>> > > >>>> use
>> > > >>>> > > > > > > RepeatedHolders
>> > > >>>> > > > > > > > >    of whatever type. For this discussion lets
>> start
>> > > >>>> with a
>> > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out
>> > the
>> > > >>>> best
>> > > >>>> > way
>> > > >>>> > > to
>> > > >>>> > > > > > > create
>> > > >>>> > > > > > > > a
>> > > >>>> > > > > > > > > new
>> > > >>>> > > > > > > > >    one. I have not figured out where in the
>> existing
>> > > >>>> drill
>> > > >>>> > code
>> > > >>>> > > > > > someone
>> > > >>>> > > > > > > > > does
>> > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
>> > > Workspace
>> > > >>>> > object
>> > > >>>> > > > is
>> > > >>>> > > > > is
>> > > >>>> > > > > > > > null
>> > > >>>> > > > > > > > > to
>> > > >>>> > > > > > > > >    start with. I created a new one in the startup
>> > > >>>> section of
>> > > >>>> > > the
>> > > >>>> > > > > udf
>> > > >>>> > > > > > > but
>> > > >>>> > > > > > > > > the
>> > > >>>> > > > > > > > >    vector was null. I can find no reference in
>> > > creating
>> > > >>>> a new
>> > > >>>> > > > > > > > BigIntVector.
>> > > >>>> > > > > > > > >    There is a way to create a BigIntVector and I
>> did
>> > > >>>> find an
>> > > >>>> > > > > example
>> > > >>>> > > > > > of
>> > > >>>> > > > > > > > >    creating a new VarCharVector but I can't do
>> that
>> > > >>>> using the
>> > > >>>> > > > drill
>> > > >>>> > > > > > jar
>> > > >>>> > > > > > > > > files
>> > > >>>> > > > > > > > >    from 1.0. The
>> > > >>>> org.apache.drill.common.types.TypeProtos and
>> > > >>>> > > > > > > > >    the
>> > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
>> > > >>>> > > classes
>> > > >>>> > > > > do
>> > > >>>> > > > > > > not
>> > > >>>> > > > > > > > >    appear to be accessible from the drill jar
>> files.
>> > > >>>> > > > > > > > >    2. What is the best way to close out a UDF in
>> the
>> > > >>>> event it
>> > > >>>> > > > > > generates
>> > > >>>> > > > > > > > an
>> > > >>>> > > > > > > > >    exception? Are there specific steps one should
>> > > >>>> follow to
>> > > >>>> > > make
>> > > >>>> > > > a
>> > > >>>> > > > > > > clean
>> > > >>>> > > > > > > > > exit
>> > > >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
>> > > >>>> > > > > > > > >
>> > > >>>> > > > > > > >
>> > > >>>> > > > > > >
>> > > >>>> > > > > >
>> > > >>>> > > > >
>> > > >>>> > > >
>> > > >>>> > >
>> > > >>>> >
>> > > >>>>
>> > > >>>
>> > > >>>
>> > > >>
>> > > >
>> > >
>> >
>>
>
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
I agree. I've gone way to deep into drill to try and get this done. While
it became clear to me that this is most likely not the way to do it.... It
has been a good learning experience around aggregation UDFs. I should be
able to put a lot back into the docs on this as soon as I figure out how to
get access to contribute to the docs. I'll file a JIRA on this but in the
short term I would still like to finish off what I started. I have my
RepeatedBigIntHolders that now no longer blow up when I try and push data
into them. Now I'm trying to see if I can get anything back out.

FunFunFun.

On Sun, Jul 5, 2015 at 1:50 PM, Jacques Nadeau <ja...@apache.org> wrote:

> It isn't obvious because you shouldn't do it.  Please file a JIRA to add
> real support for this type of output.
>
> Your current function would leak large amounts of memory that would
> ultimately crash the node.
>
> Realistically, there are very few internal Drill APIs that you should
> access via a UDF (injectables, holders, complexwriter, fieldreader and
> helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
> people don't accidentally reach into Drill's internals.  (A later
> possibility is bytecode weaving to completely protect against it).
>
> J
>
> On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > That was impressively non-obvious.
> >
> >
> >
> > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> > > I did get a new RepeatedBigIntHolder built and added a BigIntVector
> added
> > > to it. I'll try it in the UDF tomorrow and see if there is a difference
> > in
> > > the ways I found to get a BufferAllocator.
> > >
> > > .
> > > .
> > > .
> > > @Inject DrillBuf buffer;
> > > @Workspace RepeatedBigIntHolder yList;
> > > .
> > > .
> > > .
> > > @Override
> > > public void setup() {
> > > .
> > > .
> > > .
> > > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > > buffer.getAllocator();
> > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > > org.apache.drill.exec.memory.TopLevelAllocator();
> > > yList = new RepeatedBigIntHolder();
> > > yList.vector = new
> > >
> > >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> > >
> > >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> > >
> > >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > > allocator);
> > > .
> > > .
> > > .
> > > }
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:
> > >
> > > > I still have issues finding the correct way to create and use a
> > > > RepeatedHolder and Writers are a non starter for Workspace values. I
> > can
> > > > make do with creating a concatenated string in a VarCharHolder for
> > small
> > > > data sets to get past this in the short term and finish testing the
> > > output
> > > > values I expect but won't be able to do any scale till I figure out
> how
> > > to
> > > > make a repeated list.
> > > >
> > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > > >
> > > >> Well... Converting from string to integers anyway... To many 4th of
> > July
> > > >> Hot Dogs. going into nitrate overload. :)
> > > >>
> > > >> I am pulling an array of string values from json data. The string
> > values
> > > >> are actually integers. I am converting to integers and summing each
> > > >> array entry to the final tally.
> > > >>
> > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > >>
> > > >>> Ted,
> > > >>>
> > > >>> Yes, I started out just getting a basic count to work. I am trying
> to
> > > >>> keep the workflow as close to a basic user as possible. As such, I
> am
> > > >>> building and using the MapR Apache Drill sandbox to test.
> > > >>>
> > > >>>
> > > >>>    1. Always look at the drillbits.log file to see if drill had any
> > > >>>    issues loading your UDF. That was where I learned that all
> > > workspace values
> > > >>>    needed to be holders
> > > >>>       -
> > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
> loading
> > > >>>       function class
> > > >>>
> > >  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> > field
> > > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> > > variable 'xList'
> > > >>>       is of type 'interface
> > > >>>
> > >  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > > >>>       Please change it to Holder type.
> > > >>>    2. Error messages:
> > > >>>       - If you get an error in this format it means that Drill can
> > not
> > > >>>       find your function so it probably didn't load it. back to
> step
> > 1:
> > > >>>          -
> > > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44:
> > No
> > > >>>          match found for function signature MyFunctionName(<ANY>)
> > > >>>       - If you get an error in this format it means that the
> function
> > > >>>       is there but Drill could not find a signature that matched
> the
> > > param types
> > > >>>       or param numbers you were passing it. The exact wording will
> > > change but
> > > >>>       the Missing function implementation is the key phrase to look
> > > for:
> > > >>>          -
> > > >>>          - Error: SYSTEM ERROR:
> > > >>>          org.apache.drill.exec.exception.SchemaChangeException:
> > > Failure while trying
> > > >>>          to materialize incoming schema.  Errors:
> > > >>>          - Error in expression at index -1.  Error: Missing
> function
> > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> > > expression: --UNKNOWN
> > > >>>          EXPRESSION--
> > > >>>       3. In your function definition for aggregate functions you
> need
> > > >>>    to set null processing to internal and your isRandom to false.
> > > Example
> > > >>>    below:
> > > >>>       -
> > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > >>>       isBinaryCommutative = false, costCategory =
> > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > >>>
> > > >>> Below is an example from the Apache Drill tutorial data sets
> > contained
> > > >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> > > values
> > > >>> from json data. The string values are actually integers. I am
> > > converting to
> > > >>> string and summing each array entry to the final tally. This in no
> > way
> > > >>> represents what this data was for but it did become a handy way for
> > me
> > > to
> > > >>> peck out the "correct" way to build an aggregation UDF function
> > > >>>
> > > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > >>> isBinaryCommutative = false, costCategory =
> > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > >>> public static class MyArraySum implements DrillAggFunc {
> > > >>>
> > > >>> @Param RepeatedVarCharHolder listToSearch;
> > > >>> @Workspace NullableBigIntHolder count;
> > > >>> @Workspace NullableBigIntHolder sum;
> > > >>> @Workspace NullableVarCharHolder vc;
> > > >>> @Output BigIntHolder out;
> > > >>>
> > > >>> @Override
> > > >>> public void setup() {
> > > >>> count.value=0;
> > > >>> sum.value = 0;
> > > >>> }
> > > >>>
> > > >>> @Override
> > > >>> public void add() {
> > > >>> int c = listToSearch.end - listToSearch.start;
> > > >>> int val = 0;
> > > >>> try {
> > > >>> for(int i=0; i<c; i++){
> > > >>> listToSearch.vector.getAccessor().get(i, vc);
> > > >>> String inputStr =
> > > >>>
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > > >>> vc.end, vc.buffer);
> > > >>> val = Integer.parseInt(inputStr);
> > > >>> sum.value = sum.value + val;
> > > >>> }
> > > >>> } catch (Exception e) {
> > > >>> val = 0;
> > > >>> }
> > > >>> count.value = count.value + 1;
> > > >>> }
> > > >>>
> > > >>> Example select statement:
> > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit
> > 5);
> > > >>>
> > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <ted.dunning@gmail.com
> >
> > > >>> wrote:
> > > >>>
> > > >>>> Jim,
> > > >>>>
> > > >>>> I think that you may be having trouble with aggregators in
> general.
> > > >>>>
> > > >>>> Have you been able to build *any* aggregator of anything?  I
> > haven't.
> > > >>>>
> > > >>>> When I try to build an aggregator of int's or doubles, I get a
> very
> > > >>>> persistent problem with Drill even seeing my aggregates:
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > > >>>> cp.`employee.json`;*
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:35 PM
> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > > match
> > > >>>> found for function signature sum_int(<ANY>)
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:35 PM
> org.apache.calcite.runtime.CalciteException
> > > >>>> <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > line
> > > 1,
> > > >>>> column 8 to line 1, column 27: No match found for function
> signature
> > > >>>> sum_int(<ANY>)
> > > >>>>
> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27:
> No
> > > >>>> match
> > > >>>> found for function signature sum_int(<ANY>)*
> > > >>>>
> > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
> 10.0.1.2:31010
> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int))
> > from
> > > >>>> cp.`employee.json`*;
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:45 PM
> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > > match
> > > >>>> found for function signature sum_int(<NUMERIC>)
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:45 PM
> org.apache.calcite.runtime.CalciteException
> > > >>>> <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > line
> > > 1,
> > > >>>> column 8 to line 1, column 40: No match found for function
> signature
> > > >>>> sum_int(<NUMERIC>)
> > > >>>>
> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40:
> No
> > > >>>> match
> > > >>>> found for function signature sum_int(<NUMERIC>)*
> > > >>>>
> > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
> 10.0.1.2:31010
> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local>
> > > >>>>
> > > >>>>
> > > >>>> It looks like there is some undocumented subtlety about how to
> > > register
> > > >>>> an
> > > >>>> aggregator.
> > > >>>>
> > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > >>>>
> > > >>>> > I'm working on the same thing. I want to aggregate a list of
> > values.
> > > >>>> It has
> > > >>>> > been a search and guess game for the most part. I'm still stuck
> in
> > > the
> > > >>>> > process of getting the values all into a list. The writers look
> > > >>>> interesting
> > > >>>> > but for aggregation functions  it looks like the input is the
> > param
> > > >>>> and
> > > >>>> > output objects can't hold the aggregations steps. The Workspace
> is
> > > >>>> where
> > > >>>> > that happens. If I try and use a Writer in a workspace it won't
> > load
> > > >>>> and
> > > >>>> > tells me to change it to Holders which was why I was using them
> to
> > > >>>> start
> > > >>>> > with. Maybe I'm missing the architecture of the agg function. It
> > > >>>> looked
> > > >>>> > like it was....
> > > >>>> >
> > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
> process
> > > data
> > > >>>> > through @Workspace vars in add -> finalize @Output in output.
> > > >>>> >
> > > >>>> > So I'm back to trying to figure out how to create a
> > > >>>> RepeatedBigIntHolder or
> > > >>>> > a RepeatedVarCharHolder...
> > > >>>> >
> > > >>>> >
> > > >>>> >
> > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > >>>> wrote:
> > > >>>> >
> > > >>>> > > I am working on trying to build any kind of list constructing
> > > >>>> aggregator
> > > >>>> > > and having absolute fits.
> > > >>>> > >
> > > >>>> > > To simplify life, I decided to just build a generic list
> builder
> > > >>>> that is
> > > >>>> > a
> > > >>>> > > scalar function that returns a list containing its argument.
> > Thus
> > > >>>> > zoop(3)
> > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> > > >>>> > >
> > > >>>> > > The ComplexWriter looks like the place to go. As usual, the
> > > >>>> complete lack
> > > >>>> > > of comments in most of Drill makes this very hard since I have
> > to
> > > >>>> guess
> > > >>>> > > what works and what doesn't.
> > > >>>> > >
> > > >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> > > >>>> method.  I
> > > >>>> > > used this in zip and it works nicely to construct lists for
> > > >>>> output.  I
> > > >>>> > note
> > > >>>> > > that the resulting ListWriter has a method
> > copyReader(FieldReader
> > > >>>> var1)
> > > >>>> > > which looks really good.
> > > >>>> > >
> > > >>>> > > Unfortunately, the only implementation of copyReader() is in
> > > >>>> > > AbstractFieldWriter and it looks this:
> > > >>>> > >
> > > >>>> > > public void copyReader(FieldReader reader) {
> > > >>>> > >     this.fail("Copy FieldReader");
> > > >>>> > > }
> > > >>>> > >
> > > >>>> > > I would like to formally say at this point "WTF"?
> > > >>>> > >
> > > >>>> > > In digging in further, I see other methods that look handy
> like
> > > >>>> > >
> > > >>>> > > public void write(IntHolder holder) {
> > > >>>> > >     this.fail("Int");
> > > >>>> > > }
> > > >>>> > >
> > > >>>> > > And then in looking at implementations, it looks like there
> is a
> > > >>>> > > combinatorial explosion because every type seems to need a
> write
> > > >>>> method
> > > >>>> > for
> > > >>>> > > every other type.
> > > >>>> > >
> > > >>>> > > What is the thought here?  How can I copy an arbitrary value
> > into
> > > a
> > > >>>> list?
> > > >>>> > >
> > > >>>> > > My next thought was to build code that dispatches on type.
> > There
> > > >>>> is a
> > > >>>> > > method called getType() on the FieldReader.  Unfortunately,
> that
> > > >>>> drives
> > > >>>> > > into code generated by protoc and I see no way to dispatch on
> > the
> > > >>>> type of
> > > >>>> > > an incoming value.
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > How is this supposed to work?
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > > baid.mehant@gmail.com>
> > > >>>> > wrote:
> > > >>>> > >
> > > >>>> > > > For a detailed example on using ComplexWriter interface you
> > can
> > > >>>> take a
> > > >>>> > > look
> > > >>>> > > > at the Mappify
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > >>>> > > > >
> > > >>>> > > > (kvgen) function. The function itself is very simple however
> > it
> > > >>>> makes
> > > >>>> > use
> > > >>>> > > > of the utility methods in MappifyUtility
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > >>>> > > > >
> > > >>>> > > > and MapUtility
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > >>>> > > > >
> > > >>>> > > > which perform most of the work.
> > > >>>> > > >
> > > >>>> > > > Currently we don't have a generic infrastructure to handle
> > > errors
> > > >>>> > coming
> > > >>>> > > > out of functions. However there is UserException, which when
> > > >>>> raised
> > > >>>> > will
> > > >>>> > > > make sure that Drill does not gobble up the error message in
> > > that
> > > >>>> > > > exception. So you can probably throw a UserException with
> the
> > > >>>> failing
> > > >>>> > > input
> > > >>>> > > > in your function to make sure it propagates to the user.
> > > >>>> > > >
> > > >>>> > > > Thanks
> > > >>>> > > > Mehant
> > > >>>> > > >
> > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > > >>>> jacques@apache.org>
> > > >>>> > > wrote:
> > > >>>> > > >
> > > >>>> > > > > *Holders are for both input and output.  You can also use
> > > >>>> > CompleWriter
> > > >>>> > > > for
> > > >>>> > > > > output and FieldReader for input if you want to write or
> > read
> > > a
> > > >>>> > complex
> > > >>>> > > > > value.
> > > >>>> > > > >
> > > >>>> > > > > I don't think we've provided a really clean way to
> > construct a
> > > >>>> > > > > Repeated*Holder for output purposes.  You can probably do
> it
> > > by
> > > >>>> > > reaching
> > > >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
> > > would
> > > >>>> > > recommend
> > > >>>> > > > > using the ComplexWriter output pattern for now.  This will
> > be
> > > a
> > > >>>> > little
> > > >>>> > > > less
> > > >>>> > > > > efficient but substantially less brittle.  I suggest you
> > open
> > > >>>> up a
> > > >>>> > jira
> > > >>>> > > > for
> > > >>>> > > > > using a Repeated*Holder as an output.
> > > >>>> > > > >
> > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > > >>>> ted.dunning@gmail.com>
> > > >>>> > > > wrote:
> > > >>>> > > > >
> > > >>>> > > > > > Holders are for input, I think.
> > > >>>> > > > > >
> > > >>>> > > > > > Try the different kinds of writers.
> > > >>>> > > > > >
> > > >>>> > > > > >
> > > >>>> > > > > >
> > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > > >>>> jbates@maprtech.com>
> > > >>>> > > > wrote:
> > > >>>> > > > > >
> > > >>>> > > > > > > Using a repeatedholder as a @param I've got working. I
> > was
> > > >>>> > working
> > > >>>> > > > on a
> > > >>>> > > > > > > custom aggregator function using DrillAggFunc. In
> this I
> > > >>>> can do
> > > >>>> > > > simple
> > > >>>> > > > > > > things but If I want to build a list values and do
> > > >>>> something with
> > > >>>> > > it
> > > >>>> > > > in
> > > >>>> > > > > > the
> > > >>>> > > > > > > final output method I think I need to use
> > RepeatedHolders
> > > >>>> in the
> > > >>>> > > > > > > @Workspace. To do that I need to create a new one in
> the
> > > >>>> setup
> > > >>>> > > > method.
> > > >>>> > > > > I
> > > >>>> > > > > > > can't get one built. They all require a
> BufferAllocator
> > to
> > > >>>> be
> > > >>>> > > passed
> > > >>>> > > > in
> > > >>>> > > > > > to
> > > >>>> > > > > > > build it. I have not found a way to get an allocator
> > yet.
> > > >>>> Any
> > > >>>> > > > > > suggestions?
> > > >>>> > > > > > >
> > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > > >>>> > ted.dunning@gmail.com
> > > >>>> > > >
> > > >>>> > > > > > wrote:
> > > >>>> > > > > > >
> > > >>>> > > > > > > > If you look at the zip function in
> > > >>>> > > > > > > >
> https://github.com/mapr-demos/simple-drill-functions
> > > you
> > > >>>> can
> > > >>>> > > have
> > > >>>> > > > an
> > > >>>> > > > > > > > example of building a structure.
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > The basic idea is that your output is denoted as
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >         @Output
> > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > The pattern for building a list of lists of integers
> > is
> > > >>>> like
> > > >>>> > > this:
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >         writer.setValueCount(n);
> > > >>>> > > > > > > >         ...
> > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > > writer.rootAsList();
> > > >>>> > > > > > > >         outer.start(); // [ outer list
> > > >>>> > > > > > > >         ...
> > > >>>> > > > > > > >         // for each inner list
> > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> > outer.list();
> > > >>>> > > > > > > >             inner.start();
> > > >>>> > > > > > > >             // for each inner list element
> > > >>>> > > > > > > >
> > >  inner.integer().writeInt(accessor.get(i));
> > > >>>> > > > > > > >             }
> > > >>>> > > > > > > >             inner.end();   // ] inner list
> > > >>>> > > > > > > >         }
> > > >>>> > > > > > > >         outer.end(); // ] outer list
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > > >>>> > jbates@maprtech.com>
> > > >>>> > > > > > wrote:
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > > I have working aggregation and simple UDFs. I've
> > been
> > > >>>> trying
> > > >>>> > to
> > > >>>> > > > > > > document
> > > >>>> > > > > > > > > and understand each of the options available in a
> > > Drill
> > > >>>> UDF.
> > > >>>> > > > > > > > Understanding
> > > >>>> > > > > > > > > the different FunctionScope's, the ones that are
> > > >>>> allowed, the
> > > >>>> > > > ones
> > > >>>> > > > > > that
> > > >>>> > > > > > > > are
> > > >>>> > > > > > > > > not. The impact of different cost categories. The
> > > >>>> different
> > > >>>> > > > steps
> > > >>>> > > > > > > needed
> > > >>>> > > > > > > > > to understand handling any of the supported data
> > types
> > > >>>> and
> > > >>>> > > > > > structures
> > > >>>> > > > > > > in
> > > >>>> > > > > > > > > drill.
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > > Here are a few of my current road blocks. Any
> > pointers
> > > >>>> would
> > > >>>> > be
> > > >>>> > > > > > greatly
> > > >>>> > > > > > > > > appreciated.
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > >    1. I've been trying to understand how to
> > correctly
> > > >>>> use
> > > >>>> > > > > > > RepeatedHolders
> > > >>>> > > > > > > > >    of whatever type. For this discussion lets
> start
> > > >>>> with a
> > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out
> > the
> > > >>>> best
> > > >>>> > way
> > > >>>> > > to
> > > >>>> > > > > > > create
> > > >>>> > > > > > > > a
> > > >>>> > > > > > > > > new
> > > >>>> > > > > > > > >    one. I have not figured out where in the
> existing
> > > >>>> drill
> > > >>>> > code
> > > >>>> > > > > > someone
> > > >>>> > > > > > > > > does
> > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> > > Workspace
> > > >>>> > object
> > > >>>> > > > is
> > > >>>> > > > > is
> > > >>>> > > > > > > > null
> > > >>>> > > > > > > > > to
> > > >>>> > > > > > > > >    start with. I created a new one in the startup
> > > >>>> section of
> > > >>>> > > the
> > > >>>> > > > > udf
> > > >>>> > > > > > > but
> > > >>>> > > > > > > > > the
> > > >>>> > > > > > > > >    vector was null. I can find no reference in
> > > creating
> > > >>>> a new
> > > >>>> > > > > > > > BigIntVector.
> > > >>>> > > > > > > > >    There is a way to create a BigIntVector and I
> did
> > > >>>> find an
> > > >>>> > > > > example
> > > >>>> > > > > > of
> > > >>>> > > > > > > > >    creating a new VarCharVector but I can't do
> that
> > > >>>> using the
> > > >>>> > > > drill
> > > >>>> > > > > > jar
> > > >>>> > > > > > > > > files
> > > >>>> > > > > > > > >    from 1.0. The
> > > >>>> org.apache.drill.common.types.TypeProtos and
> > > >>>> > > > > > > > >    the
> > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > > >>>> > > classes
> > > >>>> > > > > do
> > > >>>> > > > > > > not
> > > >>>> > > > > > > > >    appear to be accessible from the drill jar
> files.
> > > >>>> > > > > > > > >    2. What is the best way to close out a UDF in
> the
> > > >>>> event it
> > > >>>> > > > > > generates
> > > >>>> > > > > > > > an
> > > >>>> > > > > > > > >    exception? Are there specific steps one should
> > > >>>> follow to
> > > >>>> > > make
> > > >>>> > > > a
> > > >>>> > > > > > > clean
> > > >>>> > > > > > > > > exit
> > > >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > >
> > > >>>> > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jacques Nadeau <ja...@apache.org>.
That's good news Jim.

These are interfaces that very few people have built against to date so any
suggestions for improvements, clarifications, etc would be greatly
appreciated.

Thanks
Jacques
On Jul 5, 2015 3:07 PM, "Jim Bates" <jb...@maprtech.com> wrote:

> Just to close out this thread....
>
> I got my final UDFs to work. I ended up with 2. One to create an array of
> values and the other to calculate a simple linear regression. This data set
> was a simple x = y slope
>
> SELECT MyLinearRegression2(xValues,yValues,CAST(22356 as BIGINT)) as
> xPerdict FROM (SELECT MyList(test_field1) as xValues, MyList(test_field2)
> as yValues  FROM (SELECT test_field1,test_field2 FROM
> `hive.default`.`my_hive_table` limit 10));
> +-----------+
> | xPerdict  |
> +-----------+
> | 22356.0   |
> +-----------+
>
>
> On Sun, Jul 5, 2015 at 4:10 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > You're right.  You're off the beaten path. I think everyone here would
> love
> > to have more documentation and more comments. Of course, all of these
> take
> > time.
> >
> > If you have time to volunteer to help improve these things, that would be
> > great.
> >
> > With regards to the question about the jira, describe your use case and
> > what functionality you couldn't find or make work. The active developers
> on
> > the project can then do their best to help shape the Jira into better
> docs,
> > javadocs and/or new functionality as time allows.
> >
> > On Jul 5, 2015 1:37 PM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> > > Uh... actually, I think that it isn't obvious because there is
> absolutely
> > > no documentation and there are no comments in the code.
> > >
> > > And what should the JIRA say?  We can't even tell what's missing, if
> > > anything, because we can't tell how it is supposed to work.
> > >
> > >
> > >
> > >
> > > On Sun, Jul 5, 2015 at 11:50 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > It isn't obvious because you shouldn't do it.  Please file a JIRA to
> > add
> > > > real support for this type of output.
> > > >
> > > > Your current function would leak large amounts of memory that would
> > > > ultimately crash the node.
> > > >
> > > > Realistically, there are very few internal Drill APIs that you should
> > > > access via a UDF (injectables, holders, complexwriter, fieldreader
> and
> > > > helpers).  A post 1.0 goal was to provide a UDF interface JAR to
> ensure
> > > > people don't accidentally reach into Drill's internals.  (A later
> > > > possibility is bytecode weaving to completely protect against it).
> > > >
> > > > J
> > > >
> > > > On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > That was impressively non-obvious.
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > > >
> > > > > > I did get a new RepeatedBigIntHolder built and added a
> BigIntVector
> > > > added
> > > > > > to it. I'll try it in the UDF tomorrow and see if there is a
> > > difference
> > > > > in
> > > > > > the ways I found to get a BufferAllocator.
> > > > > >
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > > @Inject DrillBuf buffer;
> > > > > > @Workspace RepeatedBigIntHolder yList;
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > > @Override
> > > > > > public void setup() {
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > > > > > buffer.getAllocator();
> > > > > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > > > > > org.apache.drill.exec.memory.TopLevelAllocator();
> > > > > > yList = new RepeatedBigIntHolder();
> > > > > > yList.vector = new
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > > > > > allocator);
> > > > > > .
> > > > > > .
> > > > > > .
> > > > > > }
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > > > >
> > > > > > > I still have issues finding the correct way to create and use a
> > > > > > > RepeatedHolder and Writers are a non starter for Workspace
> > values.
> > > I
> > > > > can
> > > > > > > make do with creating a concatenated string in a VarCharHolder
> > for
> > > > > small
> > > > > > > data sets to get past this in the short term and finish testing
> > the
> > > > > > output
> > > > > > > values I expect but won't be able to do any scale till I figure
> > out
> > > > how
> > > > > > to
> > > > > > > make a repeated list.
> > > > > > >
> > > > > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jbates@maprtech.com
> >
> > > > wrote:
> > > > > > >
> > > > > > >> Well... Converting from string to integers anyway... To many
> 4th
> > > of
> > > > > July
> > > > > > >> Hot Dogs. going into nitrate overload. :)
> > > > > > >>
> > > > > > >> I am pulling an array of string values from json data. The
> > string
> > > > > values
> > > > > > >> are actually integers. I am converting to integers and summing
> > > each
> > > > > > >> array entry to the final tally.
> > > > > > >>
> > > > > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <
> jbates@maprtech.com>
> > > > > wrote:
> > > > > > >>
> > > > > > >>> Ted,
> > > > > > >>>
> > > > > > >>> Yes, I started out just getting a basic count to work. I am
> > > trying
> > > > to
> > > > > > >>> keep the workflow as close to a basic user as possible. As
> > such,
> > > I
> > > > am
> > > > > > >>> building and using the MapR Apache Drill sandbox to test.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>    1. Always look at the drillbits.log file to see if drill
> had
> > > any
> > > > > > >>>    issues loading your UDF. That was where I learned that all
> > > > > > workspace values
> > > > > > >>>    needed to be holders
> > > > > > >>>       -
> > > > > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
> > > > loading
> > > > > > >>>       function class
> > > > > > >>>
> > > > > >
> > com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> > > > > field
> > > > > > >>>       xList. Aggregate function 'MyLinearRegression1'
> workspace
> > > > > > variable 'xList'
> > > > > > >>>       is of type 'interface
> > > > > > >>>
> > > > > >
> > > org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > > > > > >>>       Please change it to Holder type.
> > > > > > >>>    2. Error messages:
> > > > > > >>>       - If you get an error in this format it means that
> Drill
> > > can
> > > > > not
> > > > > > >>>       find your function so it probably didn't load it. back
> to
> > > > step
> > > > > 1:
> > > > > > >>>          -
> > > > > > >>>          - PARSE ERROR: From line 1, column 8 to line 1,
> column
> > > 44:
> > > > > No
> > > > > > >>>          match found for function signature
> > MyFunctionName(<ANY>)
> > > > > > >>>       - If you get an error in this format it means that the
> > > > function
> > > > > > >>>       is there but Drill could not find a signature that
> > matched
> > > > the
> > > > > > param types
> > > > > > >>>       or param numbers you were passing it. The exact wording
> > > will
> > > > > > change but
> > > > > > >>>       the Missing function implementation is the key phrase
> to
> > > look
> > > > > > for:
> > > > > > >>>          -
> > > > > > >>>          - Error: SYSTEM ERROR:
> > > > > > >>>
> org.apache.drill.exec.exception.SchemaChangeException:
> > > > > > Failure while trying
> > > > > > >>>          to materialize incoming schema.  Errors:
> > > > > > >>>          - Error in expression at index -1.  Error: Missing
> > > > function
> > > > > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].
> Full
> > > > > > expression: --UNKNOWN
> > > > > > >>>          EXPRESSION--
> > > > > > >>>       3. In your function definition for aggregate functions
> > you
> > > > need
> > > > > > >>>    to set null processing to internal and your isRandom to
> > false.
> > > > > > Example
> > > > > > >>>    below:
> > > > > > >>>       -
> > > > > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > > > > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom =
> false,
> > > > > > >>>       isBinaryCommutative = false, costCategory =
> > > > > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > > > >>>
> > > > > > >>> Below is an example from the Apache Drill tutorial data sets
> > > > > contained
> > > > > > >>> in the MapR Apache Drill sandbox. I am pulling an array if
> > string
> > > > > > values
> > > > > > >>> from json data. The string values are actually integers. I am
> > > > > > converting to
> > > > > > >>> string and summing each array entry to the final tally. This
> in
> > > no
> > > > > way
> > > > > > >>> represents what this data was for but it did become a handy
> way
> > > for
> > > > > me
> > > > > > to
> > > > > > >>> peck out the "correct" way to build an aggregation UDF
> function
> > > > > > >>>
> > > > > > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > > > > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > > > > >>> isBinaryCommutative = false, costCategory =
> > > > > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > > > >>> public static class MyArraySum implements DrillAggFunc {
> > > > > > >>>
> > > > > > >>> @Param RepeatedVarCharHolder listToSearch;
> > > > > > >>> @Workspace NullableBigIntHolder count;
> > > > > > >>> @Workspace NullableBigIntHolder sum;
> > > > > > >>> @Workspace NullableVarCharHolder vc;
> > > > > > >>> @Output BigIntHolder out;
> > > > > > >>>
> > > > > > >>> @Override
> > > > > > >>> public void setup() {
> > > > > > >>> count.value=0;
> > > > > > >>> sum.value = 0;
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> @Override
> > > > > > >>> public void add() {
> > > > > > >>> int c = listToSearch.end - listToSearch.start;
> > > > > > >>> int val = 0;
> > > > > > >>> try {
> > > > > > >>> for(int i=0; i<c; i++){
> > > > > > >>> listToSearch.vector.getAccessor().get(i, vc);
> > > > > > >>> String inputStr =
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > > > > > >>> vc.end, vc.buffer);
> > > > > > >>> val = Integer.parseInt(inputStr);
> > > > > > >>> sum.value = sum.value + val;
> > > > > > >>> }
> > > > > > >>> } catch (Exception e) {
> > > > > > >>> val = 0;
> > > > > > >>> }
> > > > > > >>> count.value = count.value + 1;
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> Example select statement:
> > > > > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT
> t.trans_info.prod_id
> > as
> > > > > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t
> > > limit
> > > > > 5);
> > > > > > >>>
> > > > > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <
> > > ted.dunning@gmail.com
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Jim,
> > > > > > >>>>
> > > > > > >>>> I think that you may be having trouble with aggregators in
> > > > general.
> > > > > > >>>>
> > > > > > >>>> Have you been able to build *any* aggregator of anything?  I
> > > > > haven't.
> > > > > > >>>>
> > > > > > >>>> When I try to build an aggregator of int's or doubles, I
> get a
> > > > very
> > > > > > >>>> persistent problem with Drill even seeing my aggregates:
> > > > > > >>>>
> > > > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > > > > > >>>> cp.`employee.json`;*
> > > > > > >>>>
> > > > > > >>>> Jul 04, 2015 4:19:35 PM
> > > > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > > > >>>>
> > > > > > >>>> SEVERE:
> org.apache.calcite.sql.validate.SqlValidatorException:
> > > No
> > > > > > match
> > > > > > >>>> found for function signature sum_int(<ANY>)
> > > > > > >>>>
> > > > > > >>>> Jul 04, 2015 4:19:35 PM
> > > > org.apache.calcite.runtime.CalciteException
> > > > > > >>>> <init>
> > > > > > >>>>
> > > > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException:
> > From
> > > > > line
> > > > > > 1,
> > > > > > >>>> column 8 to line 1, column 27: No match found for function
> > > > signature
> > > > > > >>>> sum_int(<ANY>)
> > > > > > >>>>
> > > > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column
> > 27:
> > > > No
> > > > > > >>>> match
> > > > > > >>>> found for function signature sum_int(<ANY>)*
> > > > > > >>>>
> > > > > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
> > > > 10.0.1.2:31010
> > > > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > > > >>>>
> > > > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as
> > > int))
> > > > > from
> > > > > > >>>> cp.`employee.json`*;
> > > > > > >>>>
> > > > > > >>>> Jul 04, 2015 4:19:45 PM
> > > > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > > > >>>>
> > > > > > >>>> SEVERE:
> org.apache.calcite.sql.validate.SqlValidatorException:
> > > No
> > > > > > match
> > > > > > >>>> found for function signature sum_int(<NUMERIC>)
> > > > > > >>>>
> > > > > > >>>> Jul 04, 2015 4:19:45 PM
> > > > org.apache.calcite.runtime.CalciteException
> > > > > > >>>> <init>
> > > > > > >>>>
> > > > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException:
> > From
> > > > > line
> > > > > > 1,
> > > > > > >>>> column 8 to line 1, column 40: No match found for function
> > > > signature
> > > > > > >>>> sum_int(<NUMERIC>)
> > > > > > >>>>
> > > > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column
> > 40:
> > > > No
> > > > > > >>>> match
> > > > > > >>>> found for function signature sum_int(<NUMERIC>)*
> > > > > > >>>>
> > > > > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
> > > > 10.0.1.2:31010
> > > > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > > > >>>>
> > > > > > >>>> 0: jdbc:drill:zk=local>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> It looks like there is some undocumented subtlety about how
> to
> > > > > > register
> > > > > > >>>> an
> > > > > > >>>> aggregator.
> > > > > > >>>>
> > > > > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <
> > jbates@maprtech.com>
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>> > I'm working on the same thing. I want to aggregate a list
> of
> > > > > values.
> > > > > > >>>> It has
> > > > > > >>>> > been a search and guess game for the most part. I'm still
> > > stuck
> > > > in
> > > > > > the
> > > > > > >>>> > process of getting the values all into a list. The writers
> > > look
> > > > > > >>>> interesting
> > > > > > >>>> > but for aggregation functions  it looks like the input is
> > the
> > > > > param
> > > > > > >>>> and
> > > > > > >>>> > output objects can't hold the aggregations steps. The
> > > Workspace
> > > > is
> > > > > > >>>> where
> > > > > > >>>> > that happens. If I try and use a Writer in a workspace it
> > > won't
> > > > > load
> > > > > > >>>> and
> > > > > > >>>> > tells me to change it to Holders which was why I was using
> > > them
> > > > to
> > > > > > >>>> start
> > > > > > >>>> > with. Maybe I'm missing the architecture of the agg
> > function.
> > > It
> > > > > > >>>> looked
> > > > > > >>>> > like it was....
> > > > > > >>>> >
> > > > > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
> > > > process
> > > > > > data
> > > > > > >>>> > through @Workspace vars in add -> finalize @Output in
> > output.
> > > > > > >>>> >
> > > > > > >>>> > So I'm back to trying to figure out how to create a
> > > > > > >>>> RepeatedBigIntHolder or
> > > > > > >>>> > a RepeatedVarCharHolder...
> > > > > > >>>> >
> > > > > > >>>> >
> > > > > > >>>> >
> > > > > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> > > > > ted.dunning@gmail.com>
> > > > > > >>>> wrote:
> > > > > > >>>> >
> > > > > > >>>> > > I am working on trying to build any kind of list
> > > constructing
> > > > > > >>>> aggregator
> > > > > > >>>> > > and having absolute fits.
> > > > > > >>>> > >
> > > > > > >>>> > > To simplify life, I decided to just build a generic list
> > > > builder
> > > > > > >>>> that is
> > > > > > >>>> > a
> > > > > > >>>> > > scalar function that returns a list containing its
> > argument.
> > > > > Thus
> > > > > > >>>> > zoop(3)
> > > > > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) =>
> > [[1,2,3]].
> > > > > > >>>> > >
> > > > > > >>>> > > The ComplexWriter looks like the place to go. As usual,
> > the
> > > > > > >>>> complete lack
> > > > > > >>>> > > of comments in most of Drill makes this very hard since
> I
> > > have
> > > > > to
> > > > > > >>>> guess
> > > > > > >>>> > > what works and what doesn't.
> > > > > > >>>> > >
> > > > > > >>>> > > In my code, I note that ComplexWriter has a nice
> > > rootAsList()
> > > > > > >>>> method.  I
> > > > > > >>>> > > used this in zip and it works nicely to construct lists
> > for
> > > > > > >>>> output.  I
> > > > > > >>>> > note
> > > > > > >>>> > > that the resulting ListWriter has a method
> > > > > copyReader(FieldReader
> > > > > > >>>> var1)
> > > > > > >>>> > > which looks really good.
> > > > > > >>>> > >
> > > > > > >>>> > > Unfortunately, the only implementation of copyReader()
> is
> > in
> > > > > > >>>> > > AbstractFieldWriter and it looks this:
> > > > > > >>>> > >
> > > > > > >>>> > > public void copyReader(FieldReader reader) {
> > > > > > >>>> > >     this.fail("Copy FieldReader");
> > > > > > >>>> > > }
> > > > > > >>>> > >
> > > > > > >>>> > > I would like to formally say at this point "WTF"?
> > > > > > >>>> > >
> > > > > > >>>> > > In digging in further, I see other methods that look
> handy
> > > > like
> > > > > > >>>> > >
> > > > > > >>>> > > public void write(IntHolder holder) {
> > > > > > >>>> > >     this.fail("Int");
> > > > > > >>>> > > }
> > > > > > >>>> > >
> > > > > > >>>> > > And then in looking at implementations, it looks like
> > there
> > > > is a
> > > > > > >>>> > > combinatorial explosion because every type seems to
> need a
> > > > write
> > > > > > >>>> method
> > > > > > >>>> > for
> > > > > > >>>> > > every other type.
> > > > > > >>>> > >
> > > > > > >>>> > > What is the thought here?  How can I copy an arbitrary
> > value
> > > > > into
> > > > > > a
> > > > > > >>>> list?
> > > > > > >>>> > >
> > > > > > >>>> > > My next thought was to build code that dispatches on
> type.
> > > > > There
> > > > > > >>>> is a
> > > > > > >>>> > > method called getType() on the FieldReader.
> > Unfortunately,
> > > > that
> > > > > > >>>> drives
> > > > > > >>>> > > into code generated by protoc and I see no way to
> dispatch
> > > on
> > > > > the
> > > > > > >>>> type of
> > > > > > >>>> > > an incoming value.
> > > > > > >>>> > >
> > > > > > >>>> > >
> > > > > > >>>> > > How is this supposed to work?
> > > > > > >>>> > >
> > > > > > >>>> > >
> > > > > > >>>> > >
> > > > > > >>>> > >
> > > > > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > > > > > baid.mehant@gmail.com>
> > > > > > >>>> > wrote:
> > > > > > >>>> > >
> > > > > > >>>> > > > For a detailed example on using ComplexWriter
> interface
> > > you
> > > > > can
> > > > > > >>>> take a
> > > > > > >>>> > > look
> > > > > > >>>> > > > at the Mappify
> > > > > > >>>> > > > <
> > > > > > >>>> > > >
> > > > > > >>>> > >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > > > > >>>> > > > >
> > > > > > >>>> > > > (kvgen) function. The function itself is very simple
> > > however
> > > > > it
> > > > > > >>>> makes
> > > > > > >>>> > use
> > > > > > >>>> > > > of the utility methods in MappifyUtility
> > > > > > >>>> > > > <
> > > > > > >>>> > > >
> > > > > > >>>> > >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > > > > >>>> > > > >
> > > > > > >>>> > > > and MapUtility
> > > > > > >>>> > > > <
> > > > > > >>>> > > >
> > > > > > >>>> > >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > > > > >>>> > > > >
> > > > > > >>>> > > > which perform most of the work.
> > > > > > >>>> > > >
> > > > > > >>>> > > > Currently we don't have a generic infrastructure to
> > handle
> > > > > > errors
> > > > > > >>>> > coming
> > > > > > >>>> > > > out of functions. However there is UserException,
> which
> > > when
> > > > > > >>>> raised
> > > > > > >>>> > will
> > > > > > >>>> > > > make sure that Drill does not gobble up the error
> > message
> > > in
> > > > > > that
> > > > > > >>>> > > > exception. So you can probably throw a UserException
> > with
> > > > the
> > > > > > >>>> failing
> > > > > > >>>> > > input
> > > > > > >>>> > > > in your function to make sure it propagates to the
> user.
> > > > > > >>>> > > >
> > > > > > >>>> > > > Thanks
> > > > > > >>>> > > > Mehant
> > > > > > >>>> > > >
> > > > > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > > > > > >>>> jacques@apache.org>
> > > > > > >>>> > > wrote:
> > > > > > >>>> > > >
> > > > > > >>>> > > > > *Holders are for both input and output.  You can
> also
> > > use
> > > > > > >>>> > CompleWriter
> > > > > > >>>> > > > for
> > > > > > >>>> > > > > output and FieldReader for input if you want to
> write
> > or
> > > > > read
> > > > > > a
> > > > > > >>>> > complex
> > > > > > >>>> > > > > value.
> > > > > > >>>> > > > >
> > > > > > >>>> > > > > I don't think we've provided a really clean way to
> > > > > construct a
> > > > > > >>>> > > > > Repeated*Holder for output purposes.  You can
> probably
> > > do
> > > > it
> > > > > > by
> > > > > > >>>> > > reaching
> > > > > > >>>> > > > > into a bunch of internal interfaces in Drill.
> > However,
> > > I
> > > > > > would
> > > > > > >>>> > > recommend
> > > > > > >>>> > > > > using the ComplexWriter output pattern for now.
> This
> > > will
> > > > > be
> > > > > > a
> > > > > > >>>> > little
> > > > > > >>>> > > > less
> > > > > > >>>> > > > > efficient but substantially less brittle.  I suggest
> > you
> > > > > open
> > > > > > >>>> up a
> > > > > > >>>> > jira
> > > > > > >>>> > > > for
> > > > > > >>>> > > > > using a Repeated*Holder as an output.
> > > > > > >>>> > > > >
> > > > > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > > > > > >>>> ted.dunning@gmail.com>
> > > > > > >>>> > > > wrote:
> > > > > > >>>> > > > >
> > > > > > >>>> > > > > > Holders are for input, I think.
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > > > Try the different kinds of writers.
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > > > > > >>>> jbates@maprtech.com>
> > > > > > >>>> > > > wrote:
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > > > > Using a repeatedholder as a @param I've got
> > > working. I
> > > > > was
> > > > > > >>>> > working
> > > > > > >>>> > > > on a
> > > > > > >>>> > > > > > > custom aggregator function using DrillAggFunc.
> In
> > > > this I
> > > > > > >>>> can do
> > > > > > >>>> > > > simple
> > > > > > >>>> > > > > > > things but If I want to build a list values and
> do
> > > > > > >>>> something with
> > > > > > >>>> > > it
> > > > > > >>>> > > > in
> > > > > > >>>> > > > > > the
> > > > > > >>>> > > > > > > final output method I think I need to use
> > > > > RepeatedHolders
> > > > > > >>>> in the
> > > > > > >>>> > > > > > > @Workspace. To do that I need to create a new
> one
> > in
> > > > the
> > > > > > >>>> setup
> > > > > > >>>> > > > method.
> > > > > > >>>> > > > > I
> > > > > > >>>> > > > > > > can't get one built. They all require a
> > > > BufferAllocator
> > > > > to
> > > > > > >>>> be
> > > > > > >>>> > > passed
> > > > > > >>>> > > > in
> > > > > > >>>> > > > > > to
> > > > > > >>>> > > > > > > build it. I have not found a way to get an
> > allocator
> > > > > yet.
> > > > > > >>>> Any
> > > > > > >>>> > > > > > suggestions?
> > > > > > >>>> > > > > > >
> > > > > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > > > > > >>>> > ted.dunning@gmail.com
> > > > > > >>>> > > >
> > > > > > >>>> > > > > > wrote:
> > > > > > >>>> > > > > > >
> > > > > > >>>> > > > > > > > If you look at the zip function in
> > > > > > >>>> > > > > > > >
> > > > https://github.com/mapr-demos/simple-drill-functions
> > > > > > you
> > > > > > >>>> can
> > > > > > >>>> > > have
> > > > > > >>>> > > > an
> > > > > > >>>> > > > > > > > example of building a structure.
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > > The basic idea is that your output is denoted
> as
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > >         @Output
> > > > > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > > The pattern for building a list of lists of
> > > integers
> > > > > is
> > > > > > >>>> like
> > > > > > >>>> > > this:
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > >         writer.setValueCount(n);
> > > > > > >>>> > > > > > > >         ...
> > > > > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > > > > > writer.rootAsList();
> > > > > > >>>> > > > > > > >         outer.start(); // [ outer list
> > > > > > >>>> > > > > > > >         ...
> > > > > > >>>> > > > > > > >         // for each inner list
> > > > > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> > > > > outer.list();
> > > > > > >>>> > > > > > > >             inner.start();
> > > > > > >>>> > > > > > > >             // for each inner list element
> > > > > > >>>> > > > > > > >
> > > > > >  inner.integer().writeInt(accessor.get(i));
> > > > > > >>>> > > > > > > >             }
> > > > > > >>>> > > > > > > >             inner.end();   // ] inner list
> > > > > > >>>> > > > > > > >         }
> > > > > > >>>> > > > > > > >         outer.end(); // ] outer list
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > > > > > >>>> > jbates@maprtech.com>
> > > > > > >>>> > > > > > wrote:
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > > > > I have working aggregation and simple UDFs.
> > I've
> > > > > been
> > > > > > >>>> trying
> > > > > > >>>> > to
> > > > > > >>>> > > > > > > document
> > > > > > >>>> > > > > > > > > and understand each of the options available
> > in
> > > a
> > > > > > Drill
> > > > > > >>>> UDF.
> > > > > > >>>> > > > > > > > Understanding
> > > > > > >>>> > > > > > > > > the different FunctionScope's, the ones that
> > are
> > > > > > >>>> allowed, the
> > > > > > >>>> > > > ones
> > > > > > >>>> > > > > > that
> > > > > > >>>> > > > > > > > are
> > > > > > >>>> > > > > > > > > not. The impact of different cost
> categories.
> > > The
> > > > > > >>>> different
> > > > > > >>>> > > > steps
> > > > > > >>>> > > > > > > needed
> > > > > > >>>> > > > > > > > > to understand handling any of the supported
> > data
> > > > > types
> > > > > > >>>> and
> > > > > > >>>> > > > > > structures
> > > > > > >>>> > > > > > > in
> > > > > > >>>> > > > > > > > > drill.
> > > > > > >>>> > > > > > > > >
> > > > > > >>>> > > > > > > > > Here are a few of my current road blocks.
> Any
> > > > > pointers
> > > > > > >>>> would
> > > > > > >>>> > be
> > > > > > >>>> > > > > > greatly
> > > > > > >>>> > > > > > > > > appreciated.
> > > > > > >>>> > > > > > > > >
> > > > > > >>>> > > > > > > > >
> > > > > > >>>> > > > > > > > >    1. I've been trying to understand how to
> > > > > correctly
> > > > > > >>>> use
> > > > > > >>>> > > > > > > RepeatedHolders
> > > > > > >>>> > > > > > > > >    of whatever type. For this discussion
> lets
> > > > start
> > > > > > >>>> with a
> > > > > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to
> figure
> > > out
> > > > > the
> > > > > > >>>> best
> > > > > > >>>> > way
> > > > > > >>>> > > to
> > > > > > >>>> > > > > > > create
> > > > > > >>>> > > > > > > > a
> > > > > > >>>> > > > > > > > > new
> > > > > > >>>> > > > > > > > >    one. I have not figured out where in the
> > > > existing
> > > > > > >>>> drill
> > > > > > >>>> > code
> > > > > > >>>> > > > > > someone
> > > > > > >>>> > > > > > > > > does
> > > > > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder
> as a
> > > > > > Workspace
> > > > > > >>>> > object
> > > > > > >>>> > > > is
> > > > > > >>>> > > > > is
> > > > > > >>>> > > > > > > > null
> > > > > > >>>> > > > > > > > > to
> > > > > > >>>> > > > > > > > >    start with. I created a new one in the
> > > startup
> > > > > > >>>> section of
> > > > > > >>>> > > the
> > > > > > >>>> > > > > udf
> > > > > > >>>> > > > > > > but
> > > > > > >>>> > > > > > > > > the
> > > > > > >>>> > > > > > > > >    vector was null. I can find no reference
> in
> > > > > > creating
> > > > > > >>>> a new
> > > > > > >>>> > > > > > > > BigIntVector.
> > > > > > >>>> > > > > > > > >    There is a way to create a BigIntVector
> > and I
> > > > did
> > > > > > >>>> find an
> > > > > > >>>> > > > > example
> > > > > > >>>> > > > > > of
> > > > > > >>>> > > > > > > > >    creating a new VarCharVector but I can't
> do
> > > > that
> > > > > > >>>> using the
> > > > > > >>>> > > > drill
> > > > > > >>>> > > > > > jar
> > > > > > >>>> > > > > > > > > files
> > > > > > >>>> > > > > > > > >    from 1.0. The
> > > > > > >>>> org.apache.drill.common.types.TypeProtos and
> > > > > > >>>> > > > > > > > >    the
> > > > > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > > > > > >>>> > > classes
> > > > > > >>>> > > > > do
> > > > > > >>>> > > > > > > not
> > > > > > >>>> > > > > > > > >    appear to be accessible from the drill
> jar
> > > > files.
> > > > > > >>>> > > > > > > > >    2. What is the best way to close out a
> UDF
> > in
> > > > the
> > > > > > >>>> event it
> > > > > > >>>> > > > > > generates
> > > > > > >>>> > > > > > > > an
> > > > > > >>>> > > > > > > > >    exception? Are there specific steps one
> > > should
> > > > > > >>>> follow to
> > > > > > >>>> > > make
> > > > > > >>>> > > > a
> > > > > > >>>> > > > > > > clean
> > > > > > >>>> > > > > > > > > exit
> > > > > > >>>> > > > > > > > >    in a catch block that are beneficial to
> > > Drill?
> > > > > > >>>> > > > > > > > >
> > > > > > >>>> > > > > > > >
> > > > > > >>>> > > > > > >
> > > > > > >>>> > > > > >
> > > > > > >>>> > > > >
> > > > > > >>>> > > >
> > > > > > >>>> > >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Just to close out this thread....

I got my final UDFs to work. I ended up with 2. One to create an array of
values and the other to calculate a simple linear regression. This data set
was a simple x = y slope

SELECT MyLinearRegression2(xValues,yValues,CAST(22356 as BIGINT)) as
xPerdict FROM (SELECT MyList(test_field1) as xValues, MyList(test_field2)
as yValues  FROM (SELECT test_field1,test_field2 FROM
`hive.default`.`my_hive_table` limit 10));
+-----------+
| xPerdict  |
+-----------+
| 22356.0   |
+-----------+


On Sun, Jul 5, 2015 at 4:10 PM, Jacques Nadeau <ja...@apache.org> wrote:

> You're right.  You're off the beaten path. I think everyone here would love
> to have more documentation and more comments. Of course, all of these take
> time.
>
> If you have time to volunteer to help improve these things, that would be
> great.
>
> With regards to the question about the jira, describe your use case and
> what functionality you couldn't find or make work. The active developers on
> the project can then do their best to help shape the Jira into better docs,
> javadocs and/or new functionality as time allows.
>
> On Jul 5, 2015 1:37 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
> > Uh... actually, I think that it isn't obvious because there is absolutely
> > no documentation and there are no comments in the code.
> >
> > And what should the JIRA say?  We can't even tell what's missing, if
> > anything, because we can't tell how it is supposed to work.
> >
> >
> >
> >
> > On Sun, Jul 5, 2015 at 11:50 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > It isn't obvious because you shouldn't do it.  Please file a JIRA to
> add
> > > real support for this type of output.
> > >
> > > Your current function would leak large amounts of memory that would
> > > ultimately crash the node.
> > >
> > > Realistically, there are very few internal Drill APIs that you should
> > > access via a UDF (injectables, holders, complexwriter, fieldreader and
> > > helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
> > > people don't accidentally reach into Drill's internals.  (A later
> > > possibility is bytecode weaving to completely protect against it).
> > >
> > > J
> > >
> > > On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > That was impressively non-obvious.
> > > >
> > > >
> > > >
> > > > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > > >
> > > > > I did get a new RepeatedBigIntHolder built and added a BigIntVector
> > > added
> > > > > to it. I'll try it in the UDF tomorrow and see if there is a
> > difference
> > > > in
> > > > > the ways I found to get a BufferAllocator.
> > > > >
> > > > > .
> > > > > .
> > > > > .
> > > > > @Inject DrillBuf buffer;
> > > > > @Workspace RepeatedBigIntHolder yList;
> > > > > .
> > > > > .
> > > > > .
> > > > > @Override
> > > > > public void setup() {
> > > > > .
> > > > > .
> > > > > .
> > > > > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > > > > buffer.getAllocator();
> > > > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > > > > org.apache.drill.exec.memory.TopLevelAllocator();
> > > > > yList = new RepeatedBigIntHolder();
> > > > > yList.vector = new
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > > > > allocator);
> > > > > .
> > > > > .
> > > > > .
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > > >
> > > > > > I still have issues finding the correct way to create and use a
> > > > > > RepeatedHolder and Writers are a non starter for Workspace
> values.
> > I
> > > > can
> > > > > > make do with creating a concatenated string in a VarCharHolder
> for
> > > > small
> > > > > > data sets to get past this in the short term and finish testing
> the
> > > > > output
> > > > > > values I expect but won't be able to do any scale till I figure
> out
> > > how
> > > > > to
> > > > > > make a repeated list.
> > > > > >
> > > > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > > > >
> > > > > >> Well... Converting from string to integers anyway... To many 4th
> > of
> > > > July
> > > > > >> Hot Dogs. going into nitrate overload. :)
> > > > > >>
> > > > > >> I am pulling an array of string values from json data. The
> string
> > > > values
> > > > > >> are actually integers. I am converting to integers and summing
> > each
> > > > > >> array entry to the final tally.
> > > > > >>
> > > > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
> > > > wrote:
> > > > > >>
> > > > > >>> Ted,
> > > > > >>>
> > > > > >>> Yes, I started out just getting a basic count to work. I am
> > trying
> > > to
> > > > > >>> keep the workflow as close to a basic user as possible. As
> such,
> > I
> > > am
> > > > > >>> building and using the MapR Apache Drill sandbox to test.
> > > > > >>>
> > > > > >>>
> > > > > >>>    1. Always look at the drillbits.log file to see if drill had
> > any
> > > > > >>>    issues loading your UDF. That was where I learned that all
> > > > > workspace values
> > > > > >>>    needed to be holders
> > > > > >>>       -
> > > > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
> > > loading
> > > > > >>>       function class
> > > > > >>>
> > > > >
> com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> > > > field
> > > > > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> > > > > variable 'xList'
> > > > > >>>       is of type 'interface
> > > > > >>>
> > > > >
> > org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > > > > >>>       Please change it to Holder type.
> > > > > >>>    2. Error messages:
> > > > > >>>       - If you get an error in this format it means that Drill
> > can
> > > > not
> > > > > >>>       find your function so it probably didn't load it. back to
> > > step
> > > > 1:
> > > > > >>>          -
> > > > > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column
> > 44:
> > > > No
> > > > > >>>          match found for function signature
> MyFunctionName(<ANY>)
> > > > > >>>       - If you get an error in this format it means that the
> > > function
> > > > > >>>       is there but Drill could not find a signature that
> matched
> > > the
> > > > > param types
> > > > > >>>       or param numbers you were passing it. The exact wording
> > will
> > > > > change but
> > > > > >>>       the Missing function implementation is the key phrase to
> > look
> > > > > for:
> > > > > >>>          -
> > > > > >>>          - Error: SYSTEM ERROR:
> > > > > >>>          org.apache.drill.exec.exception.SchemaChangeException:
> > > > > Failure while trying
> > > > > >>>          to materialize incoming schema.  Errors:
> > > > > >>>          - Error in expression at index -1.  Error: Missing
> > > function
> > > > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> > > > > expression: --UNKNOWN
> > > > > >>>          EXPRESSION--
> > > > > >>>       3. In your function definition for aggregate functions
> you
> > > need
> > > > > >>>    to set null processing to internal and your isRandom to
> false.
> > > > > Example
> > > > > >>>    below:
> > > > > >>>       -
> > > > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > > > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > > > >>>       isBinaryCommutative = false, costCategory =
> > > > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > > >>>
> > > > > >>> Below is an example from the Apache Drill tutorial data sets
> > > > contained
> > > > > >>> in the MapR Apache Drill sandbox. I am pulling an array if
> string
> > > > > values
> > > > > >>> from json data. The string values are actually integers. I am
> > > > > converting to
> > > > > >>> string and summing each array entry to the final tally. This in
> > no
> > > > way
> > > > > >>> represents what this data was for but it did become a handy way
> > for
> > > > me
> > > > > to
> > > > > >>> peck out the "correct" way to build an aggregation UDF function
> > > > > >>>
> > > > > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > > > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > > > >>> isBinaryCommutative = false, costCategory =
> > > > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > > >>> public static class MyArraySum implements DrillAggFunc {
> > > > > >>>
> > > > > >>> @Param RepeatedVarCharHolder listToSearch;
> > > > > >>> @Workspace NullableBigIntHolder count;
> > > > > >>> @Workspace NullableBigIntHolder sum;
> > > > > >>> @Workspace NullableVarCharHolder vc;
> > > > > >>> @Output BigIntHolder out;
> > > > > >>>
> > > > > >>> @Override
> > > > > >>> public void setup() {
> > > > > >>> count.value=0;
> > > > > >>> sum.value = 0;
> > > > > >>> }
> > > > > >>>
> > > > > >>> @Override
> > > > > >>> public void add() {
> > > > > >>> int c = listToSearch.end - listToSearch.start;
> > > > > >>> int val = 0;
> > > > > >>> try {
> > > > > >>> for(int i=0; i<c; i++){
> > > > > >>> listToSearch.vector.getAccessor().get(i, vc);
> > > > > >>> String inputStr =
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > > > > >>> vc.end, vc.buffer);
> > > > > >>> val = Integer.parseInt(inputStr);
> > > > > >>> sum.value = sum.value + val;
> > > > > >>> }
> > > > > >>> } catch (Exception e) {
> > > > > >>> val = 0;
> > > > > >>> }
> > > > > >>> count.value = count.value + 1;
> > > > > >>> }
> > > > > >>>
> > > > > >>> Example select statement:
> > > > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id
> as
> > > > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t
> > limit
> > > > 5);
> > > > > >>>
> > > > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <
> > ted.dunning@gmail.com
> > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Jim,
> > > > > >>>>
> > > > > >>>> I think that you may be having trouble with aggregators in
> > > general.
> > > > > >>>>
> > > > > >>>> Have you been able to build *any* aggregator of anything?  I
> > > > haven't.
> > > > > >>>>
> > > > > >>>> When I try to build an aggregator of int's or doubles, I get a
> > > very
> > > > > >>>> persistent problem with Drill even seeing my aggregates:
> > > > > >>>>
> > > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > > > > >>>> cp.`employee.json`;*
> > > > > >>>>
> > > > > >>>> Jul 04, 2015 4:19:35 PM
> > > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > > >>>>
> > > > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException:
> > No
> > > > > match
> > > > > >>>> found for function signature sum_int(<ANY>)
> > > > > >>>>
> > > > > >>>> Jul 04, 2015 4:19:35 PM
> > > org.apache.calcite.runtime.CalciteException
> > > > > >>>> <init>
> > > > > >>>>
> > > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException:
> From
> > > > line
> > > > > 1,
> > > > > >>>> column 8 to line 1, column 27: No match found for function
> > > signature
> > > > > >>>> sum_int(<ANY>)
> > > > > >>>>
> > > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column
> 27:
> > > No
> > > > > >>>> match
> > > > > >>>> found for function signature sum_int(<ANY>)*
> > > > > >>>>
> > > > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
> > > 10.0.1.2:31010
> > > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > > >>>>
> > > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as
> > int))
> > > > from
> > > > > >>>> cp.`employee.json`*;
> > > > > >>>>
> > > > > >>>> Jul 04, 2015 4:19:45 PM
> > > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > > >>>>
> > > > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException:
> > No
> > > > > match
> > > > > >>>> found for function signature sum_int(<NUMERIC>)
> > > > > >>>>
> > > > > >>>> Jul 04, 2015 4:19:45 PM
> > > org.apache.calcite.runtime.CalciteException
> > > > > >>>> <init>
> > > > > >>>>
> > > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException:
> From
> > > > line
> > > > > 1,
> > > > > >>>> column 8 to line 1, column 40: No match found for function
> > > signature
> > > > > >>>> sum_int(<NUMERIC>)
> > > > > >>>>
> > > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column
> 40:
> > > No
> > > > > >>>> match
> > > > > >>>> found for function signature sum_int(<NUMERIC>)*
> > > > > >>>>
> > > > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
> > > 10.0.1.2:31010
> > > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > > >>>>
> > > > > >>>> 0: jdbc:drill:zk=local>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> It looks like there is some undocumented subtlety about how to
> > > > > register
> > > > > >>>> an
> > > > > >>>> aggregator.
> > > > > >>>>
> > > > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <
> jbates@maprtech.com>
> > > > > wrote:
> > > > > >>>>
> > > > > >>>> > I'm working on the same thing. I want to aggregate a list of
> > > > values.
> > > > > >>>> It has
> > > > > >>>> > been a search and guess game for the most part. I'm still
> > stuck
> > > in
> > > > > the
> > > > > >>>> > process of getting the values all into a list. The writers
> > look
> > > > > >>>> interesting
> > > > > >>>> > but for aggregation functions  it looks like the input is
> the
> > > > param
> > > > > >>>> and
> > > > > >>>> > output objects can't hold the aggregations steps. The
> > Workspace
> > > is
> > > > > >>>> where
> > > > > >>>> > that happens. If I try and use a Writer in a workspace it
> > won't
> > > > load
> > > > > >>>> and
> > > > > >>>> > tells me to change it to Holders which was why I was using
> > them
> > > to
> > > > > >>>> start
> > > > > >>>> > with. Maybe I'm missing the architecture of the agg
> function.
> > It
> > > > > >>>> looked
> > > > > >>>> > like it was....
> > > > > >>>> >
> > > > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
> > > process
> > > > > data
> > > > > >>>> > through @Workspace vars in add -> finalize @Output in
> output.
> > > > > >>>> >
> > > > > >>>> > So I'm back to trying to figure out how to create a
> > > > > >>>> RepeatedBigIntHolder or
> > > > > >>>> > a RepeatedVarCharHolder...
> > > > > >>>> >
> > > > > >>>> >
> > > > > >>>> >
> > > > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> > > > ted.dunning@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>> >
> > > > > >>>> > > I am working on trying to build any kind of list
> > constructing
> > > > > >>>> aggregator
> > > > > >>>> > > and having absolute fits.
> > > > > >>>> > >
> > > > > >>>> > > To simplify life, I decided to just build a generic list
> > > builder
> > > > > >>>> that is
> > > > > >>>> > a
> > > > > >>>> > > scalar function that returns a list containing its
> argument.
> > > > Thus
> > > > > >>>> > zoop(3)
> > > > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) =>
> [[1,2,3]].
> > > > > >>>> > >
> > > > > >>>> > > The ComplexWriter looks like the place to go. As usual,
> the
> > > > > >>>> complete lack
> > > > > >>>> > > of comments in most of Drill makes this very hard since I
> > have
> > > > to
> > > > > >>>> guess
> > > > > >>>> > > what works and what doesn't.
> > > > > >>>> > >
> > > > > >>>> > > In my code, I note that ComplexWriter has a nice
> > rootAsList()
> > > > > >>>> method.  I
> > > > > >>>> > > used this in zip and it works nicely to construct lists
> for
> > > > > >>>> output.  I
> > > > > >>>> > note
> > > > > >>>> > > that the resulting ListWriter has a method
> > > > copyReader(FieldReader
> > > > > >>>> var1)
> > > > > >>>> > > which looks really good.
> > > > > >>>> > >
> > > > > >>>> > > Unfortunately, the only implementation of copyReader() is
> in
> > > > > >>>> > > AbstractFieldWriter and it looks this:
> > > > > >>>> > >
> > > > > >>>> > > public void copyReader(FieldReader reader) {
> > > > > >>>> > >     this.fail("Copy FieldReader");
> > > > > >>>> > > }
> > > > > >>>> > >
> > > > > >>>> > > I would like to formally say at this point "WTF"?
> > > > > >>>> > >
> > > > > >>>> > > In digging in further, I see other methods that look handy
> > > like
> > > > > >>>> > >
> > > > > >>>> > > public void write(IntHolder holder) {
> > > > > >>>> > >     this.fail("Int");
> > > > > >>>> > > }
> > > > > >>>> > >
> > > > > >>>> > > And then in looking at implementations, it looks like
> there
> > > is a
> > > > > >>>> > > combinatorial explosion because every type seems to need a
> > > write
> > > > > >>>> method
> > > > > >>>> > for
> > > > > >>>> > > every other type.
> > > > > >>>> > >
> > > > > >>>> > > What is the thought here?  How can I copy an arbitrary
> value
> > > > into
> > > > > a
> > > > > >>>> list?
> > > > > >>>> > >
> > > > > >>>> > > My next thought was to build code that dispatches on type.
> > > > There
> > > > > >>>> is a
> > > > > >>>> > > method called getType() on the FieldReader.
> Unfortunately,
> > > that
> > > > > >>>> drives
> > > > > >>>> > > into code generated by protoc and I see no way to dispatch
> > on
> > > > the
> > > > > >>>> type of
> > > > > >>>> > > an incoming value.
> > > > > >>>> > >
> > > > > >>>> > >
> > > > > >>>> > > How is this supposed to work?
> > > > > >>>> > >
> > > > > >>>> > >
> > > > > >>>> > >
> > > > > >>>> > >
> > > > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > > > > baid.mehant@gmail.com>
> > > > > >>>> > wrote:
> > > > > >>>> > >
> > > > > >>>> > > > For a detailed example on using ComplexWriter interface
> > you
> > > > can
> > > > > >>>> take a
> > > > > >>>> > > look
> > > > > >>>> > > > at the Mappify
> > > > > >>>> > > > <
> > > > > >>>> > > >
> > > > > >>>> > >
> > > > > >>>> >
> > > > > >>>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > > > >>>> > > > >
> > > > > >>>> > > > (kvgen) function. The function itself is very simple
> > however
> > > > it
> > > > > >>>> makes
> > > > > >>>> > use
> > > > > >>>> > > > of the utility methods in MappifyUtility
> > > > > >>>> > > > <
> > > > > >>>> > > >
> > > > > >>>> > >
> > > > > >>>> >
> > > > > >>>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > > > >>>> > > > >
> > > > > >>>> > > > and MapUtility
> > > > > >>>> > > > <
> > > > > >>>> > > >
> > > > > >>>> > >
> > > > > >>>> >
> > > > > >>>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > > > >>>> > > > >
> > > > > >>>> > > > which perform most of the work.
> > > > > >>>> > > >
> > > > > >>>> > > > Currently we don't have a generic infrastructure to
> handle
> > > > > errors
> > > > > >>>> > coming
> > > > > >>>> > > > out of functions. However there is UserException, which
> > when
> > > > > >>>> raised
> > > > > >>>> > will
> > > > > >>>> > > > make sure that Drill does not gobble up the error
> message
> > in
> > > > > that
> > > > > >>>> > > > exception. So you can probably throw a UserException
> with
> > > the
> > > > > >>>> failing
> > > > > >>>> > > input
> > > > > >>>> > > > in your function to make sure it propagates to the user.
> > > > > >>>> > > >
> > > > > >>>> > > > Thanks
> > > > > >>>> > > > Mehant
> > > > > >>>> > > >
> > > > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > > > > >>>> jacques@apache.org>
> > > > > >>>> > > wrote:
> > > > > >>>> > > >
> > > > > >>>> > > > > *Holders are for both input and output.  You can also
> > use
> > > > > >>>> > CompleWriter
> > > > > >>>> > > > for
> > > > > >>>> > > > > output and FieldReader for input if you want to write
> or
> > > > read
> > > > > a
> > > > > >>>> > complex
> > > > > >>>> > > > > value.
> > > > > >>>> > > > >
> > > > > >>>> > > > > I don't think we've provided a really clean way to
> > > > construct a
> > > > > >>>> > > > > Repeated*Holder for output purposes.  You can probably
> > do
> > > it
> > > > > by
> > > > > >>>> > > reaching
> > > > > >>>> > > > > into a bunch of internal interfaces in Drill.
> However,
> > I
> > > > > would
> > > > > >>>> > > recommend
> > > > > >>>> > > > > using the ComplexWriter output pattern for now.  This
> > will
> > > > be
> > > > > a
> > > > > >>>> > little
> > > > > >>>> > > > less
> > > > > >>>> > > > > efficient but substantially less brittle.  I suggest
> you
> > > > open
> > > > > >>>> up a
> > > > > >>>> > jira
> > > > > >>>> > > > for
> > > > > >>>> > > > > using a Repeated*Holder as an output.
> > > > > >>>> > > > >
> > > > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > > > > >>>> ted.dunning@gmail.com>
> > > > > >>>> > > > wrote:
> > > > > >>>> > > > >
> > > > > >>>> > > > > > Holders are for input, I think.
> > > > > >>>> > > > > >
> > > > > >>>> > > > > > Try the different kinds of writers.
> > > > > >>>> > > > > >
> > > > > >>>> > > > > >
> > > > > >>>> > > > > >
> > > > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > > > > >>>> jbates@maprtech.com>
> > > > > >>>> > > > wrote:
> > > > > >>>> > > > > >
> > > > > >>>> > > > > > > Using a repeatedholder as a @param I've got
> > working. I
> > > > was
> > > > > >>>> > working
> > > > > >>>> > > > on a
> > > > > >>>> > > > > > > custom aggregator function using DrillAggFunc. In
> > > this I
> > > > > >>>> can do
> > > > > >>>> > > > simple
> > > > > >>>> > > > > > > things but If I want to build a list values and do
> > > > > >>>> something with
> > > > > >>>> > > it
> > > > > >>>> > > > in
> > > > > >>>> > > > > > the
> > > > > >>>> > > > > > > final output method I think I need to use
> > > > RepeatedHolders
> > > > > >>>> in the
> > > > > >>>> > > > > > > @Workspace. To do that I need to create a new one
> in
> > > the
> > > > > >>>> setup
> > > > > >>>> > > > method.
> > > > > >>>> > > > > I
> > > > > >>>> > > > > > > can't get one built. They all require a
> > > BufferAllocator
> > > > to
> > > > > >>>> be
> > > > > >>>> > > passed
> > > > > >>>> > > > in
> > > > > >>>> > > > > > to
> > > > > >>>> > > > > > > build it. I have not found a way to get an
> allocator
> > > > yet.
> > > > > >>>> Any
> > > > > >>>> > > > > > suggestions?
> > > > > >>>> > > > > > >
> > > > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > > > > >>>> > ted.dunning@gmail.com
> > > > > >>>> > > >
> > > > > >>>> > > > > > wrote:
> > > > > >>>> > > > > > >
> > > > > >>>> > > > > > > > If you look at the zip function in
> > > > > >>>> > > > > > > >
> > > https://github.com/mapr-demos/simple-drill-functions
> > > > > you
> > > > > >>>> can
> > > > > >>>> > > have
> > > > > >>>> > > > an
> > > > > >>>> > > > > > > > example of building a structure.
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > > The basic idea is that your output is denoted as
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > >         @Output
> > > > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > > The pattern for building a list of lists of
> > integers
> > > > is
> > > > > >>>> like
> > > > > >>>> > > this:
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > >         writer.setValueCount(n);
> > > > > >>>> > > > > > > >         ...
> > > > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > > > > writer.rootAsList();
> > > > > >>>> > > > > > > >         outer.start(); // [ outer list
> > > > > >>>> > > > > > > >         ...
> > > > > >>>> > > > > > > >         // for each inner list
> > > > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> > > > outer.list();
> > > > > >>>> > > > > > > >             inner.start();
> > > > > >>>> > > > > > > >             // for each inner list element
> > > > > >>>> > > > > > > >
> > > > >  inner.integer().writeInt(accessor.get(i));
> > > > > >>>> > > > > > > >             }
> > > > > >>>> > > > > > > >             inner.end();   // ] inner list
> > > > > >>>> > > > > > > >         }
> > > > > >>>> > > > > > > >         outer.end(); // ] outer list
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > > > > >>>> > jbates@maprtech.com>
> > > > > >>>> > > > > > wrote:
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > > > > I have working aggregation and simple UDFs.
> I've
> > > > been
> > > > > >>>> trying
> > > > > >>>> > to
> > > > > >>>> > > > > > > document
> > > > > >>>> > > > > > > > > and understand each of the options available
> in
> > a
> > > > > Drill
> > > > > >>>> UDF.
> > > > > >>>> > > > > > > > Understanding
> > > > > >>>> > > > > > > > > the different FunctionScope's, the ones that
> are
> > > > > >>>> allowed, the
> > > > > >>>> > > > ones
> > > > > >>>> > > > > > that
> > > > > >>>> > > > > > > > are
> > > > > >>>> > > > > > > > > not. The impact of different cost categories.
> > The
> > > > > >>>> different
> > > > > >>>> > > > steps
> > > > > >>>> > > > > > > needed
> > > > > >>>> > > > > > > > > to understand handling any of the supported
> data
> > > > types
> > > > > >>>> and
> > > > > >>>> > > > > > structures
> > > > > >>>> > > > > > > in
> > > > > >>>> > > > > > > > > drill.
> > > > > >>>> > > > > > > > >
> > > > > >>>> > > > > > > > > Here are a few of my current road blocks. Any
> > > > pointers
> > > > > >>>> would
> > > > > >>>> > be
> > > > > >>>> > > > > > greatly
> > > > > >>>> > > > > > > > > appreciated.
> > > > > >>>> > > > > > > > >
> > > > > >>>> > > > > > > > >
> > > > > >>>> > > > > > > > >    1. I've been trying to understand how to
> > > > correctly
> > > > > >>>> use
> > > > > >>>> > > > > > > RepeatedHolders
> > > > > >>>> > > > > > > > >    of whatever type. For this discussion lets
> > > start
> > > > > >>>> with a
> > > > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure
> > out
> > > > the
> > > > > >>>> best
> > > > > >>>> > way
> > > > > >>>> > > to
> > > > > >>>> > > > > > > create
> > > > > >>>> > > > > > > > a
> > > > > >>>> > > > > > > > > new
> > > > > >>>> > > > > > > > >    one. I have not figured out where in the
> > > existing
> > > > > >>>> drill
> > > > > >>>> > code
> > > > > >>>> > > > > > someone
> > > > > >>>> > > > > > > > > does
> > > > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> > > > > Workspace
> > > > > >>>> > object
> > > > > >>>> > > > is
> > > > > >>>> > > > > is
> > > > > >>>> > > > > > > > null
> > > > > >>>> > > > > > > > > to
> > > > > >>>> > > > > > > > >    start with. I created a new one in the
> > startup
> > > > > >>>> section of
> > > > > >>>> > > the
> > > > > >>>> > > > > udf
> > > > > >>>> > > > > > > but
> > > > > >>>> > > > > > > > > the
> > > > > >>>> > > > > > > > >    vector was null. I can find no reference in
> > > > > creating
> > > > > >>>> a new
> > > > > >>>> > > > > > > > BigIntVector.
> > > > > >>>> > > > > > > > >    There is a way to create a BigIntVector
> and I
> > > did
> > > > > >>>> find an
> > > > > >>>> > > > > example
> > > > > >>>> > > > > > of
> > > > > >>>> > > > > > > > >    creating a new VarCharVector but I can't do
> > > that
> > > > > >>>> using the
> > > > > >>>> > > > drill
> > > > > >>>> > > > > > jar
> > > > > >>>> > > > > > > > > files
> > > > > >>>> > > > > > > > >    from 1.0. The
> > > > > >>>> org.apache.drill.common.types.TypeProtos and
> > > > > >>>> > > > > > > > >    the
> > > > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > > > > >>>> > > classes
> > > > > >>>> > > > > do
> > > > > >>>> > > > > > > not
> > > > > >>>> > > > > > > > >    appear to be accessible from the drill jar
> > > files.
> > > > > >>>> > > > > > > > >    2. What is the best way to close out a UDF
> in
> > > the
> > > > > >>>> event it
> > > > > >>>> > > > > > generates
> > > > > >>>> > > > > > > > an
> > > > > >>>> > > > > > > > >    exception? Are there specific steps one
> > should
> > > > > >>>> follow to
> > > > > >>>> > > make
> > > > > >>>> > > > a
> > > > > >>>> > > > > > > clean
> > > > > >>>> > > > > > > > > exit
> > > > > >>>> > > > > > > > >    in a catch block that are beneficial to
> > Drill?
> > > > > >>>> > > > > > > > >
> > > > > >>>> > > > > > > >
> > > > > >>>> > > > > > >
> > > > > >>>> > > > > >
> > > > > >>>> > > > >
> > > > > >>>> > > >
> > > > > >>>> > >
> > > > > >>>> >
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jacques Nadeau <ja...@apache.org>.
You're right.  You're off the beaten path. I think everyone here would love
to have more documentation and more comments. Of course, all of these take
time.

If you have time to volunteer to help improve these things, that would be
great.

With regards to the question about the jira, describe your use case and
what functionality you couldn't find or make work. The active developers on
the project can then do their best to help shape the Jira into better docs,
javadocs and/or new functionality as time allows.

On Jul 5, 2015 1:37 PM, "Ted Dunning" <te...@gmail.com> wrote:

> Uh... actually, I think that it isn't obvious because there is absolutely
> no documentation and there are no comments in the code.
>
> And what should the JIRA say?  We can't even tell what's missing, if
> anything, because we can't tell how it is supposed to work.
>
>
>
>
> On Sun, Jul 5, 2015 at 11:50 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > It isn't obvious because you shouldn't do it.  Please file a JIRA to add
> > real support for this type of output.
> >
> > Your current function would leak large amounts of memory that would
> > ultimately crash the node.
> >
> > Realistically, there are very few internal Drill APIs that you should
> > access via a UDF (injectables, holders, complexwriter, fieldreader and
> > helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
> > people don't accidentally reach into Drill's internals.  (A later
> > possibility is bytecode weaving to completely protect against it).
> >
> > J
> >
> > On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > That was impressively non-obvious.
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:
> > >
> > > > I did get a new RepeatedBigIntHolder built and added a BigIntVector
> > added
> > > > to it. I'll try it in the UDF tomorrow and see if there is a
> difference
> > > in
> > > > the ways I found to get a BufferAllocator.
> > > >
> > > > .
> > > > .
> > > > .
> > > > @Inject DrillBuf buffer;
> > > > @Workspace RepeatedBigIntHolder yList;
> > > > .
> > > > .
> > > > .
> > > > @Override
> > > > public void setup() {
> > > > .
> > > > .
> > > > .
> > > > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > > > buffer.getAllocator();
> > > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > > > org.apache.drill.exec.memory.TopLevelAllocator();
> > > > yList = new RepeatedBigIntHolder();
> > > > yList.vector = new
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> > > >
> > > >
> > >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> > > >
> > > >
> > >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > > > allocator);
> > > > .
> > > > .
> > > > .
> > > > }
> > > >
> > > >
> > > >
> > > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > > >
> > > > > I still have issues finding the correct way to create and use a
> > > > > RepeatedHolder and Writers are a non starter for Workspace values.
> I
> > > can
> > > > > make do with creating a concatenated string in a VarCharHolder for
> > > small
> > > > > data sets to get past this in the short term and finish testing the
> > > > output
> > > > > values I expect but won't be able to do any scale till I figure out
> > how
> > > > to
> > > > > make a repeated list.
> > > > >
> > > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > > >
> > > > >> Well... Converting from string to integers anyway... To many 4th
> of
> > > July
> > > > >> Hot Dogs. going into nitrate overload. :)
> > > > >>
> > > > >> I am pulling an array of string values from json data. The string
> > > values
> > > > >> are actually integers. I am converting to integers and summing
> each
> > > > >> array entry to the final tally.
> > > > >>
> > > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > > >>
> > > > >>> Ted,
> > > > >>>
> > > > >>> Yes, I started out just getting a basic count to work. I am
> trying
> > to
> > > > >>> keep the workflow as close to a basic user as possible. As such,
> I
> > am
> > > > >>> building and using the MapR Apache Drill sandbox to test.
> > > > >>>
> > > > >>>
> > > > >>>    1. Always look at the drillbits.log file to see if drill had
> any
> > > > >>>    issues loading your UDF. That was where I learned that all
> > > > workspace values
> > > > >>>    needed to be holders
> > > > >>>       -
> > > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
> > loading
> > > > >>>       function class
> > > > >>>
> > > >  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> > > field
> > > > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> > > > variable 'xList'
> > > > >>>       is of type 'interface
> > > > >>>
> > > >
> org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > > > >>>       Please change it to Holder type.
> > > > >>>    2. Error messages:
> > > > >>>       - If you get an error in this format it means that Drill
> can
> > > not
> > > > >>>       find your function so it probably didn't load it. back to
> > step
> > > 1:
> > > > >>>          -
> > > > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column
> 44:
> > > No
> > > > >>>          match found for function signature MyFunctionName(<ANY>)
> > > > >>>       - If you get an error in this format it means that the
> > function
> > > > >>>       is there but Drill could not find a signature that matched
> > the
> > > > param types
> > > > >>>       or param numbers you were passing it. The exact wording
> will
> > > > change but
> > > > >>>       the Missing function implementation is the key phrase to
> look
> > > > for:
> > > > >>>          -
> > > > >>>          - Error: SYSTEM ERROR:
> > > > >>>          org.apache.drill.exec.exception.SchemaChangeException:
> > > > Failure while trying
> > > > >>>          to materialize incoming schema.  Errors:
> > > > >>>          - Error in expression at index -1.  Error: Missing
> > function
> > > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> > > > expression: --UNKNOWN
> > > > >>>          EXPRESSION--
> > > > >>>       3. In your function definition for aggregate functions you
> > need
> > > > >>>    to set null processing to internal and your isRandom to false.
> > > > Example
> > > > >>>    below:
> > > > >>>       -
> > > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > > >>>       isBinaryCommutative = false, costCategory =
> > > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > >>>
> > > > >>> Below is an example from the Apache Drill tutorial data sets
> > > contained
> > > > >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> > > > values
> > > > >>> from json data. The string values are actually integers. I am
> > > > converting to
> > > > >>> string and summing each array entry to the final tally. This in
> no
> > > way
> > > > >>> represents what this data was for but it did become a handy way
> for
> > > me
> > > > to
> > > > >>> peck out the "correct" way to build an aggregation UDF function
> > > > >>>
> > > > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > > >>> isBinaryCommutative = false, costCategory =
> > > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > > >>> public static class MyArraySum implements DrillAggFunc {
> > > > >>>
> > > > >>> @Param RepeatedVarCharHolder listToSearch;
> > > > >>> @Workspace NullableBigIntHolder count;
> > > > >>> @Workspace NullableBigIntHolder sum;
> > > > >>> @Workspace NullableVarCharHolder vc;
> > > > >>> @Output BigIntHolder out;
> > > > >>>
> > > > >>> @Override
> > > > >>> public void setup() {
> > > > >>> count.value=0;
> > > > >>> sum.value = 0;
> > > > >>> }
> > > > >>>
> > > > >>> @Override
> > > > >>> public void add() {
> > > > >>> int c = listToSearch.end - listToSearch.start;
> > > > >>> int val = 0;
> > > > >>> try {
> > > > >>> for(int i=0; i<c; i++){
> > > > >>> listToSearch.vector.getAccessor().get(i, vc);
> > > > >>> String inputStr =
> > > > >>>
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > > > >>> vc.end, vc.buffer);
> > > > >>> val = Integer.parseInt(inputStr);
> > > > >>> sum.value = sum.value + val;
> > > > >>> }
> > > > >>> } catch (Exception e) {
> > > > >>> val = 0;
> > > > >>> }
> > > > >>> count.value = count.value + 1;
> > > > >>> }
> > > > >>>
> > > > >>> Example select statement:
> > > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> > > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t
> limit
> > > 5);
> > > > >>>
> > > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <
> ted.dunning@gmail.com
> > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Jim,
> > > > >>>>
> > > > >>>> I think that you may be having trouble with aggregators in
> > general.
> > > > >>>>
> > > > >>>> Have you been able to build *any* aggregator of anything?  I
> > > haven't.
> > > > >>>>
> > > > >>>> When I try to build an aggregator of int's or doubles, I get a
> > very
> > > > >>>> persistent problem with Drill even seeing my aggregates:
> > > > >>>>
> > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > > > >>>> cp.`employee.json`;*
> > > > >>>>
> > > > >>>> Jul 04, 2015 4:19:35 PM
> > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > >>>>
> > > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException:
> No
> > > > match
> > > > >>>> found for function signature sum_int(<ANY>)
> > > > >>>>
> > > > >>>> Jul 04, 2015 4:19:35 PM
> > org.apache.calcite.runtime.CalciteException
> > > > >>>> <init>
> > > > >>>>
> > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > > line
> > > > 1,
> > > > >>>> column 8 to line 1, column 27: No match found for function
> > signature
> > > > >>>> sum_int(<ANY>)
> > > > >>>>
> > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27:
> > No
> > > > >>>> match
> > > > >>>> found for function signature sum_int(<ANY>)*
> > > > >>>>
> > > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
> > 10.0.1.2:31010
> > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > >>>>
> > > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as
> int))
> > > from
> > > > >>>> cp.`employee.json`*;
> > > > >>>>
> > > > >>>> Jul 04, 2015 4:19:45 PM
> > > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > > >>>>
> > > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException:
> No
> > > > match
> > > > >>>> found for function signature sum_int(<NUMERIC>)
> > > > >>>>
> > > > >>>> Jul 04, 2015 4:19:45 PM
> > org.apache.calcite.runtime.CalciteException
> > > > >>>> <init>
> > > > >>>>
> > > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > > line
> > > > 1,
> > > > >>>> column 8 to line 1, column 40: No match found for function
> > signature
> > > > >>>> sum_int(<NUMERIC>)
> > > > >>>>
> > > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40:
> > No
> > > > >>>> match
> > > > >>>> found for function signature sum_int(<NUMERIC>)*
> > > > >>>>
> > > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
> > 10.0.1.2:31010
> > > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > > >>>>
> > > > >>>> 0: jdbc:drill:zk=local>
> > > > >>>>
> > > > >>>>
> > > > >>>> It looks like there is some undocumented subtlety about how to
> > > > register
> > > > >>>> an
> > > > >>>> aggregator.
> > > > >>>>
> > > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> > I'm working on the same thing. I want to aggregate a list of
> > > values.
> > > > >>>> It has
> > > > >>>> > been a search and guess game for the most part. I'm still
> stuck
> > in
> > > > the
> > > > >>>> > process of getting the values all into a list. The writers
> look
> > > > >>>> interesting
> > > > >>>> > but for aggregation functions  it looks like the input is the
> > > param
> > > > >>>> and
> > > > >>>> > output objects can't hold the aggregations steps. The
> Workspace
> > is
> > > > >>>> where
> > > > >>>> > that happens. If I try and use a Writer in a workspace it
> won't
> > > load
> > > > >>>> and
> > > > >>>> > tells me to change it to Holders which was why I was using
> them
> > to
> > > > >>>> start
> > > > >>>> > with. Maybe I'm missing the architecture of the agg function.
> It
> > > > >>>> looked
> > > > >>>> > like it was....
> > > > >>>> >
> > > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
> > process
> > > > data
> > > > >>>> > through @Workspace vars in add -> finalize @Output in output.
> > > > >>>> >
> > > > >>>> > So I'm back to trying to figure out how to create a
> > > > >>>> RepeatedBigIntHolder or
> > > > >>>> > a RepeatedVarCharHolder...
> > > > >>>> >
> > > > >>>> >
> > > > >>>> >
> > > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> > > ted.dunning@gmail.com>
> > > > >>>> wrote:
> > > > >>>> >
> > > > >>>> > > I am working on trying to build any kind of list
> constructing
> > > > >>>> aggregator
> > > > >>>> > > and having absolute fits.
> > > > >>>> > >
> > > > >>>> > > To simplify life, I decided to just build a generic list
> > builder
> > > > >>>> that is
> > > > >>>> > a
> > > > >>>> > > scalar function that returns a list containing its argument.
> > > Thus
> > > > >>>> > zoop(3)
> > > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> > > > >>>> > >
> > > > >>>> > > The ComplexWriter looks like the place to go. As usual, the
> > > > >>>> complete lack
> > > > >>>> > > of comments in most of Drill makes this very hard since I
> have
> > > to
> > > > >>>> guess
> > > > >>>> > > what works and what doesn't.
> > > > >>>> > >
> > > > >>>> > > In my code, I note that ComplexWriter has a nice
> rootAsList()
> > > > >>>> method.  I
> > > > >>>> > > used this in zip and it works nicely to construct lists for
> > > > >>>> output.  I
> > > > >>>> > note
> > > > >>>> > > that the resulting ListWriter has a method
> > > copyReader(FieldReader
> > > > >>>> var1)
> > > > >>>> > > which looks really good.
> > > > >>>> > >
> > > > >>>> > > Unfortunately, the only implementation of copyReader() is in
> > > > >>>> > > AbstractFieldWriter and it looks this:
> > > > >>>> > >
> > > > >>>> > > public void copyReader(FieldReader reader) {
> > > > >>>> > >     this.fail("Copy FieldReader");
> > > > >>>> > > }
> > > > >>>> > >
> > > > >>>> > > I would like to formally say at this point "WTF"?
> > > > >>>> > >
> > > > >>>> > > In digging in further, I see other methods that look handy
> > like
> > > > >>>> > >
> > > > >>>> > > public void write(IntHolder holder) {
> > > > >>>> > >     this.fail("Int");
> > > > >>>> > > }
> > > > >>>> > >
> > > > >>>> > > And then in looking at implementations, it looks like there
> > is a
> > > > >>>> > > combinatorial explosion because every type seems to need a
> > write
> > > > >>>> method
> > > > >>>> > for
> > > > >>>> > > every other type.
> > > > >>>> > >
> > > > >>>> > > What is the thought here?  How can I copy an arbitrary value
> > > into
> > > > a
> > > > >>>> list?
> > > > >>>> > >
> > > > >>>> > > My next thought was to build code that dispatches on type.
> > > There
> > > > >>>> is a
> > > > >>>> > > method called getType() on the FieldReader.  Unfortunately,
> > that
> > > > >>>> drives
> > > > >>>> > > into code generated by protoc and I see no way to dispatch
> on
> > > the
> > > > >>>> type of
> > > > >>>> > > an incoming value.
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > > How is this supposed to work?
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > > > baid.mehant@gmail.com>
> > > > >>>> > wrote:
> > > > >>>> > >
> > > > >>>> > > > For a detailed example on using ComplexWriter interface
> you
> > > can
> > > > >>>> take a
> > > > >>>> > > look
> > > > >>>> > > > at the Mappify
> > > > >>>> > > > <
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > > >>>> > > > >
> > > > >>>> > > > (kvgen) function. The function itself is very simple
> however
> > > it
> > > > >>>> makes
> > > > >>>> > use
> > > > >>>> > > > of the utility methods in MappifyUtility
> > > > >>>> > > > <
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > > >>>> > > > >
> > > > >>>> > > > and MapUtility
> > > > >>>> > > > <
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > > >>>> > > > >
> > > > >>>> > > > which perform most of the work.
> > > > >>>> > > >
> > > > >>>> > > > Currently we don't have a generic infrastructure to handle
> > > > errors
> > > > >>>> > coming
> > > > >>>> > > > out of functions. However there is UserException, which
> when
> > > > >>>> raised
> > > > >>>> > will
> > > > >>>> > > > make sure that Drill does not gobble up the error message
> in
> > > > that
> > > > >>>> > > > exception. So you can probably throw a UserException with
> > the
> > > > >>>> failing
> > > > >>>> > > input
> > > > >>>> > > > in your function to make sure it propagates to the user.
> > > > >>>> > > >
> > > > >>>> > > > Thanks
> > > > >>>> > > > Mehant
> > > > >>>> > > >
> > > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > > > >>>> jacques@apache.org>
> > > > >>>> > > wrote:
> > > > >>>> > > >
> > > > >>>> > > > > *Holders are for both input and output.  You can also
> use
> > > > >>>> > CompleWriter
> > > > >>>> > > > for
> > > > >>>> > > > > output and FieldReader for input if you want to write or
> > > read
> > > > a
> > > > >>>> > complex
> > > > >>>> > > > > value.
> > > > >>>> > > > >
> > > > >>>> > > > > I don't think we've provided a really clean way to
> > > construct a
> > > > >>>> > > > > Repeated*Holder for output purposes.  You can probably
> do
> > it
> > > > by
> > > > >>>> > > reaching
> > > > >>>> > > > > into a bunch of internal interfaces in Drill.  However,
> I
> > > > would
> > > > >>>> > > recommend
> > > > >>>> > > > > using the ComplexWriter output pattern for now.  This
> will
> > > be
> > > > a
> > > > >>>> > little
> > > > >>>> > > > less
> > > > >>>> > > > > efficient but substantially less brittle.  I suggest you
> > > open
> > > > >>>> up a
> > > > >>>> > jira
> > > > >>>> > > > for
> > > > >>>> > > > > using a Repeated*Holder as an output.
> > > > >>>> > > > >
> > > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > > > >>>> ted.dunning@gmail.com>
> > > > >>>> > > > wrote:
> > > > >>>> > > > >
> > > > >>>> > > > > > Holders are for input, I think.
> > > > >>>> > > > > >
> > > > >>>> > > > > > Try the different kinds of writers.
> > > > >>>> > > > > >
> > > > >>>> > > > > >
> > > > >>>> > > > > >
> > > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > > > >>>> jbates@maprtech.com>
> > > > >>>> > > > wrote:
> > > > >>>> > > > > >
> > > > >>>> > > > > > > Using a repeatedholder as a @param I've got
> working. I
> > > was
> > > > >>>> > working
> > > > >>>> > > > on a
> > > > >>>> > > > > > > custom aggregator function using DrillAggFunc. In
> > this I
> > > > >>>> can do
> > > > >>>> > > > simple
> > > > >>>> > > > > > > things but If I want to build a list values and do
> > > > >>>> something with
> > > > >>>> > > it
> > > > >>>> > > > in
> > > > >>>> > > > > > the
> > > > >>>> > > > > > > final output method I think I need to use
> > > RepeatedHolders
> > > > >>>> in the
> > > > >>>> > > > > > > @Workspace. To do that I need to create a new one in
> > the
> > > > >>>> setup
> > > > >>>> > > > method.
> > > > >>>> > > > > I
> > > > >>>> > > > > > > can't get one built. They all require a
> > BufferAllocator
> > > to
> > > > >>>> be
> > > > >>>> > > passed
> > > > >>>> > > > in
> > > > >>>> > > > > > to
> > > > >>>> > > > > > > build it. I have not found a way to get an allocator
> > > yet.
> > > > >>>> Any
> > > > >>>> > > > > > suggestions?
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > > > >>>> > ted.dunning@gmail.com
> > > > >>>> > > >
> > > > >>>> > > > > > wrote:
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > > If you look at the zip function in
> > > > >>>> > > > > > > >
> > https://github.com/mapr-demos/simple-drill-functions
> > > > you
> > > > >>>> can
> > > > >>>> > > have
> > > > >>>> > > > an
> > > > >>>> > > > > > > > example of building a structure.
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > The basic idea is that your output is denoted as
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > >         @Output
> > > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > The pattern for building a list of lists of
> integers
> > > is
> > > > >>>> like
> > > > >>>> > > this:
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > >         writer.setValueCount(n);
> > > > >>>> > > > > > > >         ...
> > > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > > > writer.rootAsList();
> > > > >>>> > > > > > > >         outer.start(); // [ outer list
> > > > >>>> > > > > > > >         ...
> > > > >>>> > > > > > > >         // for each inner list
> > > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> > > outer.list();
> > > > >>>> > > > > > > >             inner.start();
> > > > >>>> > > > > > > >             // for each inner list element
> > > > >>>> > > > > > > >
> > > >  inner.integer().writeInt(accessor.get(i));
> > > > >>>> > > > > > > >             }
> > > > >>>> > > > > > > >             inner.end();   // ] inner list
> > > > >>>> > > > > > > >         }
> > > > >>>> > > > > > > >         outer.end(); // ] outer list
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > > > >>>> > jbates@maprtech.com>
> > > > >>>> > > > > > wrote:
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > > I have working aggregation and simple UDFs. I've
> > > been
> > > > >>>> trying
> > > > >>>> > to
> > > > >>>> > > > > > > document
> > > > >>>> > > > > > > > > and understand each of the options available in
> a
> > > > Drill
> > > > >>>> UDF.
> > > > >>>> > > > > > > > Understanding
> > > > >>>> > > > > > > > > the different FunctionScope's, the ones that are
> > > > >>>> allowed, the
> > > > >>>> > > > ones
> > > > >>>> > > > > > that
> > > > >>>> > > > > > > > are
> > > > >>>> > > > > > > > > not. The impact of different cost categories.
> The
> > > > >>>> different
> > > > >>>> > > > steps
> > > > >>>> > > > > > > needed
> > > > >>>> > > > > > > > > to understand handling any of the supported data
> > > types
> > > > >>>> and
> > > > >>>> > > > > > structures
> > > > >>>> > > > > > > in
> > > > >>>> > > > > > > > > drill.
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > > > Here are a few of my current road blocks. Any
> > > pointers
> > > > >>>> would
> > > > >>>> > be
> > > > >>>> > > > > > greatly
> > > > >>>> > > > > > > > > appreciated.
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > > >    1. I've been trying to understand how to
> > > correctly
> > > > >>>> use
> > > > >>>> > > > > > > RepeatedHolders
> > > > >>>> > > > > > > > >    of whatever type. For this discussion lets
> > start
> > > > >>>> with a
> > > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure
> out
> > > the
> > > > >>>> best
> > > > >>>> > way
> > > > >>>> > > to
> > > > >>>> > > > > > > create
> > > > >>>> > > > > > > > a
> > > > >>>> > > > > > > > > new
> > > > >>>> > > > > > > > >    one. I have not figured out where in the
> > existing
> > > > >>>> drill
> > > > >>>> > code
> > > > >>>> > > > > > someone
> > > > >>>> > > > > > > > > does
> > > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> > > > Workspace
> > > > >>>> > object
> > > > >>>> > > > is
> > > > >>>> > > > > is
> > > > >>>> > > > > > > > null
> > > > >>>> > > > > > > > > to
> > > > >>>> > > > > > > > >    start with. I created a new one in the
> startup
> > > > >>>> section of
> > > > >>>> > > the
> > > > >>>> > > > > udf
> > > > >>>> > > > > > > but
> > > > >>>> > > > > > > > > the
> > > > >>>> > > > > > > > >    vector was null. I can find no reference in
> > > > creating
> > > > >>>> a new
> > > > >>>> > > > > > > > BigIntVector.
> > > > >>>> > > > > > > > >    There is a way to create a BigIntVector and I
> > did
> > > > >>>> find an
> > > > >>>> > > > > example
> > > > >>>> > > > > > of
> > > > >>>> > > > > > > > >    creating a new VarCharVector but I can't do
> > that
> > > > >>>> using the
> > > > >>>> > > > drill
> > > > >>>> > > > > > jar
> > > > >>>> > > > > > > > > files
> > > > >>>> > > > > > > > >    from 1.0. The
> > > > >>>> org.apache.drill.common.types.TypeProtos and
> > > > >>>> > > > > > > > >    the
> > > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > > > >>>> > > classes
> > > > >>>> > > > > do
> > > > >>>> > > > > > > not
> > > > >>>> > > > > > > > >    appear to be accessible from the drill jar
> > files.
> > > > >>>> > > > > > > > >    2. What is the best way to close out a UDF in
> > the
> > > > >>>> event it
> > > > >>>> > > > > > generates
> > > > >>>> > > > > > > > an
> > > > >>>> > > > > > > > >    exception? Are there specific steps one
> should
> > > > >>>> follow to
> > > > >>>> > > make
> > > > >>>> > > > a
> > > > >>>> > > > > > > clean
> > > > >>>> > > > > > > > > exit
> > > > >>>> > > > > > > > >    in a catch block that are beneficial to
> Drill?
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > >
> > > > >>>> > > > > >
> > > > >>>> > > > >
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
Uh... actually, I think that it isn't obvious because there is absolutely
no documentation and there are no comments in the code.

And what should the JIRA say?  We can't even tell what's missing, if
anything, because we can't tell how it is supposed to work.




On Sun, Jul 5, 2015 at 11:50 AM, Jacques Nadeau <ja...@apache.org> wrote:

> It isn't obvious because you shouldn't do it.  Please file a JIRA to add
> real support for this type of output.
>
> Your current function would leak large amounts of memory that would
> ultimately crash the node.
>
> Realistically, there are very few internal Drill APIs that you should
> access via a UDF (injectables, holders, complexwriter, fieldreader and
> helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
> people don't accidentally reach into Drill's internals.  (A later
> possibility is bytecode weaving to completely protect against it).
>
> J
>
> On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > That was impressively non-obvious.
> >
> >
> >
> > On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> > > I did get a new RepeatedBigIntHolder built and added a BigIntVector
> added
> > > to it. I'll try it in the UDF tomorrow and see if there is a difference
> > in
> > > the ways I found to get a BufferAllocator.
> > >
> > > .
> > > .
> > > .
> > > @Inject DrillBuf buffer;
> > > @Workspace RepeatedBigIntHolder yList;
> > > .
> > > .
> > > .
> > > @Override
> > > public void setup() {
> > > .
> > > .
> > > .
> > > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > > buffer.getAllocator();
> > > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > > org.apache.drill.exec.memory.TopLevelAllocator();
> > > yList = new RepeatedBigIntHolder();
> > > yList.vector = new
> > >
> > >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> > >
> > >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> > >
> > >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > > allocator);
> > > .
> > > .
> > > .
> > > }
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:
> > >
> > > > I still have issues finding the correct way to create and use a
> > > > RepeatedHolder and Writers are a non starter for Workspace values. I
> > can
> > > > make do with creating a concatenated string in a VarCharHolder for
> > small
> > > > data sets to get past this in the short term and finish testing the
> > > output
> > > > values I expect but won't be able to do any scale till I figure out
> how
> > > to
> > > > make a repeated list.
> > > >
> > > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > > >
> > > >> Well... Converting from string to integers anyway... To many 4th of
> > July
> > > >> Hot Dogs. going into nitrate overload. :)
> > > >>
> > > >> I am pulling an array of string values from json data. The string
> > values
> > > >> are actually integers. I am converting to integers and summing each
> > > >> array entry to the final tally.
> > > >>
> > > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > >>
> > > >>> Ted,
> > > >>>
> > > >>> Yes, I started out just getting a basic count to work. I am trying
> to
> > > >>> keep the workflow as close to a basic user as possible. As such, I
> am
> > > >>> building and using the MapR Apache Drill sandbox to test.
> > > >>>
> > > >>>
> > > >>>    1. Always look at the drillbits.log file to see if drill had any
> > > >>>    issues loading your UDF. That was where I learned that all
> > > workspace values
> > > >>>    needed to be holders
> > > >>>       -
> > > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure
> loading
> > > >>>       function class
> > > >>>
> > >  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> > field
> > > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> > > variable 'xList'
> > > >>>       is of type 'interface
> > > >>>
> > >  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > > >>>       Please change it to Holder type.
> > > >>>    2. Error messages:
> > > >>>       - If you get an error in this format it means that Drill can
> > not
> > > >>>       find your function so it probably didn't load it. back to
> step
> > 1:
> > > >>>          -
> > > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44:
> > No
> > > >>>          match found for function signature MyFunctionName(<ANY>)
> > > >>>       - If you get an error in this format it means that the
> function
> > > >>>       is there but Drill could not find a signature that matched
> the
> > > param types
> > > >>>       or param numbers you were passing it. The exact wording will
> > > change but
> > > >>>       the Missing function implementation is the key phrase to look
> > > for:
> > > >>>          -
> > > >>>          - Error: SYSTEM ERROR:
> > > >>>          org.apache.drill.exec.exception.SchemaChangeException:
> > > Failure while trying
> > > >>>          to materialize incoming schema.  Errors:
> > > >>>          - Error in expression at index -1.  Error: Missing
> function
> > > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> > > expression: --UNKNOWN
> > > >>>          EXPRESSION--
> > > >>>       3. In your function definition for aggregate functions you
> need
> > > >>>    to set null processing to internal and your isRandom to false.
> > > Example
> > > >>>    below:
> > > >>>       -
> > > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > >>>       isBinaryCommutative = false, costCategory =
> > > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > >>>
> > > >>> Below is an example from the Apache Drill tutorial data sets
> > contained
> > > >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> > > values
> > > >>> from json data. The string values are actually integers. I am
> > > converting to
> > > >>> string and summing each array entry to the final tally. This in no
> > way
> > > >>> represents what this data was for but it did become a handy way for
> > me
> > > to
> > > >>> peck out the "correct" way to build an aggregation UDF function
> > > >>>
> > > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > > >>> isBinaryCommutative = false, costCategory =
> > > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > > >>> public static class MyArraySum implements DrillAggFunc {
> > > >>>
> > > >>> @Param RepeatedVarCharHolder listToSearch;
> > > >>> @Workspace NullableBigIntHolder count;
> > > >>> @Workspace NullableBigIntHolder sum;
> > > >>> @Workspace NullableVarCharHolder vc;
> > > >>> @Output BigIntHolder out;
> > > >>>
> > > >>> @Override
> > > >>> public void setup() {
> > > >>> count.value=0;
> > > >>> sum.value = 0;
> > > >>> }
> > > >>>
> > > >>> @Override
> > > >>> public void add() {
> > > >>> int c = listToSearch.end - listToSearch.start;
> > > >>> int val = 0;
> > > >>> try {
> > > >>> for(int i=0; i<c; i++){
> > > >>> listToSearch.vector.getAccessor().get(i, vc);
> > > >>> String inputStr =
> > > >>>
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > > >>> vc.end, vc.buffer);
> > > >>> val = Integer.parseInt(inputStr);
> > > >>> sum.value = sum.value + val;
> > > >>> }
> > > >>> } catch (Exception e) {
> > > >>> val = 0;
> > > >>> }
> > > >>> count.value = count.value + 1;
> > > >>> }
> > > >>>
> > > >>> Example select statement:
> > > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> > > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit
> > 5);
> > > >>>
> > > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <ted.dunning@gmail.com
> >
> > > >>> wrote:
> > > >>>
> > > >>>> Jim,
> > > >>>>
> > > >>>> I think that you may be having trouble with aggregators in
> general.
> > > >>>>
> > > >>>> Have you been able to build *any* aggregator of anything?  I
> > haven't.
> > > >>>>
> > > >>>> When I try to build an aggregator of int's or doubles, I get a
> very
> > > >>>> persistent problem with Drill even seeing my aggregates:
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > > >>>> cp.`employee.json`;*
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:35 PM
> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > > match
> > > >>>> found for function signature sum_int(<ANY>)
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:35 PM
> org.apache.calcite.runtime.CalciteException
> > > >>>> <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > line
> > > 1,
> > > >>>> column 8 to line 1, column 27: No match found for function
> signature
> > > >>>> sum_int(<ANY>)
> > > >>>>
> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27:
> No
> > > >>>> match
> > > >>>> found for function signature sum_int(<ANY>)*
> > > >>>>
> > > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on
> 10.0.1.2:31010
> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int))
> > from
> > > >>>> cp.`employee.json`*;
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:45 PM
> > > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > > match
> > > >>>> found for function signature sum_int(<NUMERIC>)
> > > >>>>
> > > >>>> Jul 04, 2015 4:19:45 PM
> org.apache.calcite.runtime.CalciteException
> > > >>>> <init>
> > > >>>>
> > > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> > line
> > > 1,
> > > >>>> column 8 to line 1, column 40: No match found for function
> signature
> > > >>>> sum_int(<NUMERIC>)
> > > >>>>
> > > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40:
> No
> > > >>>> match
> > > >>>> found for function signature sum_int(<NUMERIC>)*
> > > >>>>
> > > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on
> 10.0.1.2:31010
> > > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > > >>>>
> > > >>>> 0: jdbc:drill:zk=local>
> > > >>>>
> > > >>>>
> > > >>>> It looks like there is some undocumented subtlety about how to
> > > register
> > > >>>> an
> > > >>>> aggregator.
> > > >>>>
> > > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > >>>>
> > > >>>> > I'm working on the same thing. I want to aggregate a list of
> > values.
> > > >>>> It has
> > > >>>> > been a search and guess game for the most part. I'm still stuck
> in
> > > the
> > > >>>> > process of getting the values all into a list. The writers look
> > > >>>> interesting
> > > >>>> > but for aggregation functions  it looks like the input is the
> > param
> > > >>>> and
> > > >>>> > output objects can't hold the aggregations steps. The Workspace
> is
> > > >>>> where
> > > >>>> > that happens. If I try and use a Writer in a workspace it won't
> > load
> > > >>>> and
> > > >>>> > tells me to change it to Holders which was why I was using them
> to
> > > >>>> start
> > > >>>> > with. Maybe I'm missing the architecture of the agg function. It
> > > >>>> looked
> > > >>>> > like it was....
> > > >>>> >
> > > >>>> > @Param comes in -> initialize @Workspace vars in setup ->
> process
> > > data
> > > >>>> > through @Workspace vars in add -> finalize @Output in output.
> > > >>>> >
> > > >>>> > So I'm back to trying to figure out how to create a
> > > >>>> RepeatedBigIntHolder or
> > > >>>> > a RepeatedVarCharHolder...
> > > >>>> >
> > > >>>> >
> > > >>>> >
> > > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > >>>> wrote:
> > > >>>> >
> > > >>>> > > I am working on trying to build any kind of list constructing
> > > >>>> aggregator
> > > >>>> > > and having absolute fits.
> > > >>>> > >
> > > >>>> > > To simplify life, I decided to just build a generic list
> builder
> > > >>>> that is
> > > >>>> > a
> > > >>>> > > scalar function that returns a list containing its argument.
> > Thus
> > > >>>> > zoop(3)
> > > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> > > >>>> > >
> > > >>>> > > The ComplexWriter looks like the place to go. As usual, the
> > > >>>> complete lack
> > > >>>> > > of comments in most of Drill makes this very hard since I have
> > to
> > > >>>> guess
> > > >>>> > > what works and what doesn't.
> > > >>>> > >
> > > >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> > > >>>> method.  I
> > > >>>> > > used this in zip and it works nicely to construct lists for
> > > >>>> output.  I
> > > >>>> > note
> > > >>>> > > that the resulting ListWriter has a method
> > copyReader(FieldReader
> > > >>>> var1)
> > > >>>> > > which looks really good.
> > > >>>> > >
> > > >>>> > > Unfortunately, the only implementation of copyReader() is in
> > > >>>> > > AbstractFieldWriter and it looks this:
> > > >>>> > >
> > > >>>> > > public void copyReader(FieldReader reader) {
> > > >>>> > >     this.fail("Copy FieldReader");
> > > >>>> > > }
> > > >>>> > >
> > > >>>> > > I would like to formally say at this point "WTF"?
> > > >>>> > >
> > > >>>> > > In digging in further, I see other methods that look handy
> like
> > > >>>> > >
> > > >>>> > > public void write(IntHolder holder) {
> > > >>>> > >     this.fail("Int");
> > > >>>> > > }
> > > >>>> > >
> > > >>>> > > And then in looking at implementations, it looks like there
> is a
> > > >>>> > > combinatorial explosion because every type seems to need a
> write
> > > >>>> method
> > > >>>> > for
> > > >>>> > > every other type.
> > > >>>> > >
> > > >>>> > > What is the thought here?  How can I copy an arbitrary value
> > into
> > > a
> > > >>>> list?
> > > >>>> > >
> > > >>>> > > My next thought was to build code that dispatches on type.
> > There
> > > >>>> is a
> > > >>>> > > method called getType() on the FieldReader.  Unfortunately,
> that
> > > >>>> drives
> > > >>>> > > into code generated by protoc and I see no way to dispatch on
> > the
> > > >>>> type of
> > > >>>> > > an incoming value.
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > How is this supposed to work?
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > > baid.mehant@gmail.com>
> > > >>>> > wrote:
> > > >>>> > >
> > > >>>> > > > For a detailed example on using ComplexWriter interface you
> > can
> > > >>>> take a
> > > >>>> > > look
> > > >>>> > > > at the Mappify
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > >>>> > > > >
> > > >>>> > > > (kvgen) function. The function itself is very simple however
> > it
> > > >>>> makes
> > > >>>> > use
> > > >>>> > > > of the utility methods in MappifyUtility
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > >>>> > > > >
> > > >>>> > > > and MapUtility
> > > >>>> > > > <
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > >>>> > > > >
> > > >>>> > > > which perform most of the work.
> > > >>>> > > >
> > > >>>> > > > Currently we don't have a generic infrastructure to handle
> > > errors
> > > >>>> > coming
> > > >>>> > > > out of functions. However there is UserException, which when
> > > >>>> raised
> > > >>>> > will
> > > >>>> > > > make sure that Drill does not gobble up the error message in
> > > that
> > > >>>> > > > exception. So you can probably throw a UserException with
> the
> > > >>>> failing
> > > >>>> > > input
> > > >>>> > > > in your function to make sure it propagates to the user.
> > > >>>> > > >
> > > >>>> > > > Thanks
> > > >>>> > > > Mehant
> > > >>>> > > >
> > > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > > >>>> jacques@apache.org>
> > > >>>> > > wrote:
> > > >>>> > > >
> > > >>>> > > > > *Holders are for both input and output.  You can also use
> > > >>>> > CompleWriter
> > > >>>> > > > for
> > > >>>> > > > > output and FieldReader for input if you want to write or
> > read
> > > a
> > > >>>> > complex
> > > >>>> > > > > value.
> > > >>>> > > > >
> > > >>>> > > > > I don't think we've provided a really clean way to
> > construct a
> > > >>>> > > > > Repeated*Holder for output purposes.  You can probably do
> it
> > > by
> > > >>>> > > reaching
> > > >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
> > > would
> > > >>>> > > recommend
> > > >>>> > > > > using the ComplexWriter output pattern for now.  This will
> > be
> > > a
> > > >>>> > little
> > > >>>> > > > less
> > > >>>> > > > > efficient but substantially less brittle.  I suggest you
> > open
> > > >>>> up a
> > > >>>> > jira
> > > >>>> > > > for
> > > >>>> > > > > using a Repeated*Holder as an output.
> > > >>>> > > > >
> > > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > > >>>> ted.dunning@gmail.com>
> > > >>>> > > > wrote:
> > > >>>> > > > >
> > > >>>> > > > > > Holders are for input, I think.
> > > >>>> > > > > >
> > > >>>> > > > > > Try the different kinds of writers.
> > > >>>> > > > > >
> > > >>>> > > > > >
> > > >>>> > > > > >
> > > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > > >>>> jbates@maprtech.com>
> > > >>>> > > > wrote:
> > > >>>> > > > > >
> > > >>>> > > > > > > Using a repeatedholder as a @param I've got working. I
> > was
> > > >>>> > working
> > > >>>> > > > on a
> > > >>>> > > > > > > custom aggregator function using DrillAggFunc. In
> this I
> > > >>>> can do
> > > >>>> > > > simple
> > > >>>> > > > > > > things but If I want to build a list values and do
> > > >>>> something with
> > > >>>> > > it
> > > >>>> > > > in
> > > >>>> > > > > > the
> > > >>>> > > > > > > final output method I think I need to use
> > RepeatedHolders
> > > >>>> in the
> > > >>>> > > > > > > @Workspace. To do that I need to create a new one in
> the
> > > >>>> setup
> > > >>>> > > > method.
> > > >>>> > > > > I
> > > >>>> > > > > > > can't get one built. They all require a
> BufferAllocator
> > to
> > > >>>> be
> > > >>>> > > passed
> > > >>>> > > > in
> > > >>>> > > > > > to
> > > >>>> > > > > > > build it. I have not found a way to get an allocator
> > yet.
> > > >>>> Any
> > > >>>> > > > > > suggestions?
> > > >>>> > > > > > >
> > > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > > >>>> > ted.dunning@gmail.com
> > > >>>> > > >
> > > >>>> > > > > > wrote:
> > > >>>> > > > > > >
> > > >>>> > > > > > > > If you look at the zip function in
> > > >>>> > > > > > > >
> https://github.com/mapr-demos/simple-drill-functions
> > > you
> > > >>>> can
> > > >>>> > > have
> > > >>>> > > > an
> > > >>>> > > > > > > > example of building a structure.
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > The basic idea is that your output is denoted as
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >         @Output
> > > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > The pattern for building a list of lists of integers
> > is
> > > >>>> like
> > > >>>> > > this:
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >         writer.setValueCount(n);
> > > >>>> > > > > > > >         ...
> > > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > > writer.rootAsList();
> > > >>>> > > > > > > >         outer.start(); // [ outer list
> > > >>>> > > > > > > >         ...
> > > >>>> > > > > > > >         // for each inner list
> > > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> > outer.list();
> > > >>>> > > > > > > >             inner.start();
> > > >>>> > > > > > > >             // for each inner list element
> > > >>>> > > > > > > >
> > >  inner.integer().writeInt(accessor.get(i));
> > > >>>> > > > > > > >             }
> > > >>>> > > > > > > >             inner.end();   // ] inner list
> > > >>>> > > > > > > >         }
> > > >>>> > > > > > > >         outer.end(); // ] outer list
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > > >>>> > jbates@maprtech.com>
> > > >>>> > > > > > wrote:
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > > I have working aggregation and simple UDFs. I've
> > been
> > > >>>> trying
> > > >>>> > to
> > > >>>> > > > > > > document
> > > >>>> > > > > > > > > and understand each of the options available in a
> > > Drill
> > > >>>> UDF.
> > > >>>> > > > > > > > Understanding
> > > >>>> > > > > > > > > the different FunctionScope's, the ones that are
> > > >>>> allowed, the
> > > >>>> > > > ones
> > > >>>> > > > > > that
> > > >>>> > > > > > > > are
> > > >>>> > > > > > > > > not. The impact of different cost categories. The
> > > >>>> different
> > > >>>> > > > steps
> > > >>>> > > > > > > needed
> > > >>>> > > > > > > > > to understand handling any of the supported data
> > types
> > > >>>> and
> > > >>>> > > > > > structures
> > > >>>> > > > > > > in
> > > >>>> > > > > > > > > drill.
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > > Here are a few of my current road blocks. Any
> > pointers
> > > >>>> would
> > > >>>> > be
> > > >>>> > > > > > greatly
> > > >>>> > > > > > > > > appreciated.
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > >    1. I've been trying to understand how to
> > correctly
> > > >>>> use
> > > >>>> > > > > > > RepeatedHolders
> > > >>>> > > > > > > > >    of whatever type. For this discussion lets
> start
> > > >>>> with a
> > > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out
> > the
> > > >>>> best
> > > >>>> > way
> > > >>>> > > to
> > > >>>> > > > > > > create
> > > >>>> > > > > > > > a
> > > >>>> > > > > > > > > new
> > > >>>> > > > > > > > >    one. I have not figured out where in the
> existing
> > > >>>> drill
> > > >>>> > code
> > > >>>> > > > > > someone
> > > >>>> > > > > > > > > does
> > > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> > > Workspace
> > > >>>> > object
> > > >>>> > > > is
> > > >>>> > > > > is
> > > >>>> > > > > > > > null
> > > >>>> > > > > > > > > to
> > > >>>> > > > > > > > >    start with. I created a new one in the startup
> > > >>>> section of
> > > >>>> > > the
> > > >>>> > > > > udf
> > > >>>> > > > > > > but
> > > >>>> > > > > > > > > the
> > > >>>> > > > > > > > >    vector was null. I can find no reference in
> > > creating
> > > >>>> a new
> > > >>>> > > > > > > > BigIntVector.
> > > >>>> > > > > > > > >    There is a way to create a BigIntVector and I
> did
> > > >>>> find an
> > > >>>> > > > > example
> > > >>>> > > > > > of
> > > >>>> > > > > > > > >    creating a new VarCharVector but I can't do
> that
> > > >>>> using the
> > > >>>> > > > drill
> > > >>>> > > > > > jar
> > > >>>> > > > > > > > > files
> > > >>>> > > > > > > > >    from 1.0. The
> > > >>>> org.apache.drill.common.types.TypeProtos and
> > > >>>> > > > > > > > >    the
> > > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > > >>>> > > classes
> > > >>>> > > > > do
> > > >>>> > > > > > > not
> > > >>>> > > > > > > > >    appear to be accessible from the drill jar
> files.
> > > >>>> > > > > > > > >    2. What is the best way to close out a UDF in
> the
> > > >>>> event it
> > > >>>> > > > > > generates
> > > >>>> > > > > > > > an
> > > >>>> > > > > > > > >    exception? Are there specific steps one should
> > > >>>> follow to
> > > >>>> > > make
> > > >>>> > > > a
> > > >>>> > > > > > > clean
> > > >>>> > > > > > > > > exit
> > > >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > >
> > > >>>> > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jacques Nadeau <ja...@apache.org>.
It isn't obvious because you shouldn't do it.  Please file a JIRA to add
real support for this type of output.

Your current function would leak large amounts of memory that would
ultimately crash the node.

Realistically, there are very few internal Drill APIs that you should
access via a UDF (injectables, holders, complexwriter, fieldreader and
helpers).  A post 1.0 goal was to provide a UDF interface JAR to ensure
people don't accidentally reach into Drill's internals.  (A later
possibility is bytecode weaving to completely protect against it).

J

On Sun, Jul 5, 2015 at 11:36 AM, Ted Dunning <te...@gmail.com> wrote:

> That was impressively non-obvious.
>
>
>
> On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:
>
> > I did get a new RepeatedBigIntHolder built and added a BigIntVector added
> > to it. I'll try it in the UDF tomorrow and see if there is a difference
> in
> > the ways I found to get a BufferAllocator.
> >
> > .
> > .
> > .
> > @Inject DrillBuf buffer;
> > @Workspace RepeatedBigIntHolder yList;
> > .
> > .
> > .
> > @Override
> > public void setup() {
> > .
> > .
> > .
> > //org.apache.drill.exec.memory.BufferAllocator allocator =
> > buffer.getAllocator();
> > org.apache.drill.exec.memory.BufferAllocator allocator =  new
> > org.apache.drill.exec.memory.TopLevelAllocator();
> > yList = new RepeatedBigIntHolder();
> > yList.vector = new
> >
> >
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
> >
> >
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
> >
> >
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> > allocator);
> > .
> > .
> > .
> > }
> >
> >
> >
> > On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> > > I still have issues finding the correct way to create and use a
> > > RepeatedHolder and Writers are a non starter for Workspace values. I
> can
> > > make do with creating a concatenated string in a VarCharHolder for
> small
> > > data sets to get past this in the short term and finish testing the
> > output
> > > values I expect but won't be able to do any scale till I figure out how
> > to
> > > make a repeated list.
> > >
> > > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com> wrote:
> > >
> > >> Well... Converting from string to integers anyway... To many 4th of
> July
> > >> Hot Dogs. going into nitrate overload. :)
> > >>
> > >> I am pulling an array of string values from json data. The string
> values
> > >> are actually integers. I am converting to integers and summing each
> > >> array entry to the final tally.
> > >>
> > >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > >>
> > >>> Ted,
> > >>>
> > >>> Yes, I started out just getting a basic count to work. I am trying to
> > >>> keep the workflow as close to a basic user as possible. As such, I am
> > >>> building and using the MapR Apache Drill sandbox to test.
> > >>>
> > >>>
> > >>>    1. Always look at the drillbits.log file to see if drill had any
> > >>>    issues loading your UDF. That was where I learned that all
> > workspace values
> > >>>    needed to be holders
> > >>>       -
> > >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
> > >>>       function class
> > >>>
> >  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
> field
> > >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> > variable 'xList'
> > >>>       is of type 'interface
> > >>>
> >  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> > >>>       Please change it to Holder type.
> > >>>    2. Error messages:
> > >>>       - If you get an error in this format it means that Drill can
> not
> > >>>       find your function so it probably didn't load it. back to step
> 1:
> > >>>          -
> > >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44:
> No
> > >>>          match found for function signature MyFunctionName(<ANY>)
> > >>>       - If you get an error in this format it means that the function
> > >>>       is there but Drill could not find a signature that matched the
> > param types
> > >>>       or param numbers you were passing it. The exact wording will
> > change but
> > >>>       the Missing function implementation is the key phrase to look
> > for:
> > >>>          -
> > >>>          - Error: SYSTEM ERROR:
> > >>>          org.apache.drill.exec.exception.SchemaChangeException:
> > Failure while trying
> > >>>          to materialize incoming schema.  Errors:
> > >>>          - Error in expression at index -1.  Error: Missing function
> > >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> > expression: --UNKNOWN
> > >>>          EXPRESSION--
> > >>>       3. In your function definition for aggregate functions you need
> > >>>    to set null processing to internal and your isRandom to false.
> > Example
> > >>>    below:
> > >>>       -
> > >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> > >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > >>>       isBinaryCommutative = false, costCategory =
> > >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> > >>>
> > >>> Below is an example from the Apache Drill tutorial data sets
> contained
> > >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> > values
> > >>> from json data. The string values are actually integers. I am
> > converting to
> > >>> string and summing each array entry to the final tally. This in no
> way
> > >>> represents what this data was for but it did become a handy way for
> me
> > to
> > >>> peck out the "correct" way to build an aggregation UDF function
> > >>>
> > >>> @FunctionTemplate(name = "MyArraySum", scope =
> > >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> > >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> > >>> isBinaryCommutative = false, costCategory =
> > >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> > >>> public static class MyArraySum implements DrillAggFunc {
> > >>>
> > >>> @Param RepeatedVarCharHolder listToSearch;
> > >>> @Workspace NullableBigIntHolder count;
> > >>> @Workspace NullableBigIntHolder sum;
> > >>> @Workspace NullableVarCharHolder vc;
> > >>> @Output BigIntHolder out;
> > >>>
> > >>> @Override
> > >>> public void setup() {
> > >>> count.value=0;
> > >>> sum.value = 0;
> > >>> }
> > >>>
> > >>> @Override
> > >>> public void add() {
> > >>> int c = listToSearch.end - listToSearch.start;
> > >>> int val = 0;
> > >>> try {
> > >>> for(int i=0; i<c; i++){
> > >>> listToSearch.vector.getAccessor().get(i, vc);
> > >>> String inputStr =
> > >>>
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> > >>> vc.end, vc.buffer);
> > >>> val = Integer.parseInt(inputStr);
> > >>> sum.value = sum.value + val;
> > >>> }
> > >>> } catch (Exception e) {
> > >>> val = 0;
> > >>> }
> > >>> count.value = count.value + 1;
> > >>> }
> > >>>
> > >>> Example select statement:
> > >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> > >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit
> 5);
> > >>>
> > >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Jim,
> > >>>>
> > >>>> I think that you may be having trouble with aggregators in general.
> > >>>>
> > >>>> Have you been able to build *any* aggregator of anything?  I
> haven't.
> > >>>>
> > >>>> When I try to build an aggregator of int's or doubles, I get a very
> > >>>> persistent problem with Drill even seeing my aggregates:
> > >>>>
> > >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> > >>>> cp.`employee.json`;*
> > >>>>
> > >>>> Jul 04, 2015 4:19:35 PM
> > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > >>>>
> > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > match
> > >>>> found for function signature sum_int(<ANY>)
> > >>>>
> > >>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
> > >>>> <init>
> > >>>>
> > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> line
> > 1,
> > >>>> column 8 to line 1, column 27: No match found for function signature
> > >>>> sum_int(<ANY>)
> > >>>>
> > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
> > >>>> match
> > >>>> found for function signature sum_int(<ANY>)*
> > >>>>
> > >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
> > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > >>>>
> > >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int))
> from
> > >>>> cp.`employee.json`*;
> > >>>>
> > >>>> Jul 04, 2015 4:19:45 PM
> > >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> > >>>>
> > >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> > match
> > >>>> found for function signature sum_int(<NUMERIC>)
> > >>>>
> > >>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
> > >>>> <init>
> > >>>>
> > >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From
> line
> > 1,
> > >>>> column 8 to line 1, column 40: No match found for function signature
> > >>>> sum_int(<NUMERIC>)
> > >>>>
> > >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
> > >>>> match
> > >>>> found for function signature sum_int(<NUMERIC>)*
> > >>>>
> > >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
> > >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> > >>>>
> > >>>> 0: jdbc:drill:zk=local>
> > >>>>
> > >>>>
> > >>>> It looks like there is some undocumented subtlety about how to
> > register
> > >>>> an
> > >>>> aggregator.
> > >>>>
> > >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > >>>>
> > >>>> > I'm working on the same thing. I want to aggregate a list of
> values.
> > >>>> It has
> > >>>> > been a search and guess game for the most part. I'm still stuck in
> > the
> > >>>> > process of getting the values all into a list. The writers look
> > >>>> interesting
> > >>>> > but for aggregation functions  it looks like the input is the
> param
> > >>>> and
> > >>>> > output objects can't hold the aggregations steps. The Workspace is
> > >>>> where
> > >>>> > that happens. If I try and use a Writer in a workspace it won't
> load
> > >>>> and
> > >>>> > tells me to change it to Holders which was why I was using them to
> > >>>> start
> > >>>> > with. Maybe I'm missing the architecture of the agg function. It
> > >>>> looked
> > >>>> > like it was....
> > >>>> >
> > >>>> > @Param comes in -> initialize @Workspace vars in setup -> process
> > data
> > >>>> > through @Workspace vars in add -> finalize @Output in output.
> > >>>> >
> > >>>> > So I'm back to trying to figure out how to create a
> > >>>> RepeatedBigIntHolder or
> > >>>> > a RepeatedVarCharHolder...
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <
> ted.dunning@gmail.com>
> > >>>> wrote:
> > >>>> >
> > >>>> > > I am working on trying to build any kind of list constructing
> > >>>> aggregator
> > >>>> > > and having absolute fits.
> > >>>> > >
> > >>>> > > To simplify life, I decided to just build a generic list builder
> > >>>> that is
> > >>>> > a
> > >>>> > > scalar function that returns a list containing its argument.
> Thus
> > >>>> > zoop(3)
> > >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> > >>>> > >
> > >>>> > > The ComplexWriter looks like the place to go. As usual, the
> > >>>> complete lack
> > >>>> > > of comments in most of Drill makes this very hard since I have
> to
> > >>>> guess
> > >>>> > > what works and what doesn't.
> > >>>> > >
> > >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> > >>>> method.  I
> > >>>> > > used this in zip and it works nicely to construct lists for
> > >>>> output.  I
> > >>>> > note
> > >>>> > > that the resulting ListWriter has a method
> copyReader(FieldReader
> > >>>> var1)
> > >>>> > > which looks really good.
> > >>>> > >
> > >>>> > > Unfortunately, the only implementation of copyReader() is in
> > >>>> > > AbstractFieldWriter and it looks this:
> > >>>> > >
> > >>>> > > public void copyReader(FieldReader reader) {
> > >>>> > >     this.fail("Copy FieldReader");
> > >>>> > > }
> > >>>> > >
> > >>>> > > I would like to formally say at this point "WTF"?
> > >>>> > >
> > >>>> > > In digging in further, I see other methods that look handy like
> > >>>> > >
> > >>>> > > public void write(IntHolder holder) {
> > >>>> > >     this.fail("Int");
> > >>>> > > }
> > >>>> > >
> > >>>> > > And then in looking at implementations, it looks like there is a
> > >>>> > > combinatorial explosion because every type seems to need a write
> > >>>> method
> > >>>> > for
> > >>>> > > every other type.
> > >>>> > >
> > >>>> > > What is the thought here?  How can I copy an arbitrary value
> into
> > a
> > >>>> list?
> > >>>> > >
> > >>>> > > My next thought was to build code that dispatches on type.
> There
> > >>>> is a
> > >>>> > > method called getType() on the FieldReader.  Unfortunately, that
> > >>>> drives
> > >>>> > > into code generated by protoc and I see no way to dispatch on
> the
> > >>>> type of
> > >>>> > > an incoming value.
> > >>>> > >
> > >>>> > >
> > >>>> > > How is this supposed to work?
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> > baid.mehant@gmail.com>
> > >>>> > wrote:
> > >>>> > >
> > >>>> > > > For a detailed example on using ComplexWriter interface you
> can
> > >>>> take a
> > >>>> > > look
> > >>>> > > > at the Mappify
> > >>>> > > > <
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > >>>> > > > >
> > >>>> > > > (kvgen) function. The function itself is very simple however
> it
> > >>>> makes
> > >>>> > use
> > >>>> > > > of the utility methods in MappifyUtility
> > >>>> > > > <
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > >>>> > > > >
> > >>>> > > > and MapUtility
> > >>>> > > > <
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > >>>> > > > >
> > >>>> > > > which perform most of the work.
> > >>>> > > >
> > >>>> > > > Currently we don't have a generic infrastructure to handle
> > errors
> > >>>> > coming
> > >>>> > > > out of functions. However there is UserException, which when
> > >>>> raised
> > >>>> > will
> > >>>> > > > make sure that Drill does not gobble up the error message in
> > that
> > >>>> > > > exception. So you can probably throw a UserException with the
> > >>>> failing
> > >>>> > > input
> > >>>> > > > in your function to make sure it propagates to the user.
> > >>>> > > >
> > >>>> > > > Thanks
> > >>>> > > > Mehant
> > >>>> > > >
> > >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> > >>>> jacques@apache.org>
> > >>>> > > wrote:
> > >>>> > > >
> > >>>> > > > > *Holders are for both input and output.  You can also use
> > >>>> > CompleWriter
> > >>>> > > > for
> > >>>> > > > > output and FieldReader for input if you want to write or
> read
> > a
> > >>>> > complex
> > >>>> > > > > value.
> > >>>> > > > >
> > >>>> > > > > I don't think we've provided a really clean way to
> construct a
> > >>>> > > > > Repeated*Holder for output purposes.  You can probably do it
> > by
> > >>>> > > reaching
> > >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
> > would
> > >>>> > > recommend
> > >>>> > > > > using the ComplexWriter output pattern for now.  This will
> be
> > a
> > >>>> > little
> > >>>> > > > less
> > >>>> > > > > efficient but substantially less brittle.  I suggest you
> open
> > >>>> up a
> > >>>> > jira
> > >>>> > > > for
> > >>>> > > > > using a Repeated*Holder as an output.
> > >>>> > > > >
> > >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> > >>>> ted.dunning@gmail.com>
> > >>>> > > > wrote:
> > >>>> > > > >
> > >>>> > > > > > Holders are for input, I think.
> > >>>> > > > > >
> > >>>> > > > > > Try the different kinds of writers.
> > >>>> > > > > >
> > >>>> > > > > >
> > >>>> > > > > >
> > >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> > >>>> jbates@maprtech.com>
> > >>>> > > > wrote:
> > >>>> > > > > >
> > >>>> > > > > > > Using a repeatedholder as a @param I've got working. I
> was
> > >>>> > working
> > >>>> > > > on a
> > >>>> > > > > > > custom aggregator function using DrillAggFunc. In this I
> > >>>> can do
> > >>>> > > > simple
> > >>>> > > > > > > things but If I want to build a list values and do
> > >>>> something with
> > >>>> > > it
> > >>>> > > > in
> > >>>> > > > > > the
> > >>>> > > > > > > final output method I think I need to use
> RepeatedHolders
> > >>>> in the
> > >>>> > > > > > > @Workspace. To do that I need to create a new one in the
> > >>>> setup
> > >>>> > > > method.
> > >>>> > > > > I
> > >>>> > > > > > > can't get one built. They all require a BufferAllocator
> to
> > >>>> be
> > >>>> > > passed
> > >>>> > > > in
> > >>>> > > > > > to
> > >>>> > > > > > > build it. I have not found a way to get an allocator
> yet.
> > >>>> Any
> > >>>> > > > > > suggestions?
> > >>>> > > > > > >
> > >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > >>>> > ted.dunning@gmail.com
> > >>>> > > >
> > >>>> > > > > > wrote:
> > >>>> > > > > > >
> > >>>> > > > > > > > If you look at the zip function in
> > >>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions
> > you
> > >>>> can
> > >>>> > > have
> > >>>> > > > an
> > >>>> > > > > > > > example of building a structure.
> > >>>> > > > > > > >
> > >>>> > > > > > > > The basic idea is that your output is denoted as
> > >>>> > > > > > > >
> > >>>> > > > > > > >         @Output
> > >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> > >>>> > > > > > > >
> > >>>> > > > > > > > The pattern for building a list of lists of integers
> is
> > >>>> like
> > >>>> > > this:
> > >>>> > > > > > > >
> > >>>> > > > > > > >         writer.setValueCount(n);
> > >>>> > > > > > > >         ...
> > >>>> > > > > > > >         BaseWriter.ListWriter outer =
> > writer.rootAsList();
> > >>>> > > > > > > >         outer.start(); // [ outer list
> > >>>> > > > > > > >         ...
> > >>>> > > > > > > >         // for each inner list
> > >>>> > > > > > > >             BaseWriter.ListWriter inner =
> outer.list();
> > >>>> > > > > > > >             inner.start();
> > >>>> > > > > > > >             // for each inner list element
> > >>>> > > > > > > >
> >  inner.integer().writeInt(accessor.get(i));
> > >>>> > > > > > > >             }
> > >>>> > > > > > > >             inner.end();   // ] inner list
> > >>>> > > > > > > >         }
> > >>>> > > > > > > >         outer.end(); // ] outer list
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > >>>> > jbates@maprtech.com>
> > >>>> > > > > > wrote:
> > >>>> > > > > > > >
> > >>>> > > > > > > > > I have working aggregation and simple UDFs. I've
> been
> > >>>> trying
> > >>>> > to
> > >>>> > > > > > > document
> > >>>> > > > > > > > > and understand each of the options available in a
> > Drill
> > >>>> UDF.
> > >>>> > > > > > > > Understanding
> > >>>> > > > > > > > > the different FunctionScope's, the ones that are
> > >>>> allowed, the
> > >>>> > > > ones
> > >>>> > > > > > that
> > >>>> > > > > > > > are
> > >>>> > > > > > > > > not. The impact of different cost categories. The
> > >>>> different
> > >>>> > > > steps
> > >>>> > > > > > > needed
> > >>>> > > > > > > > > to understand handling any of the supported data
> types
> > >>>> and
> > >>>> > > > > > structures
> > >>>> > > > > > > in
> > >>>> > > > > > > > > drill.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Here are a few of my current road blocks. Any
> pointers
> > >>>> would
> > >>>> > be
> > >>>> > > > > > greatly
> > >>>> > > > > > > > > appreciated.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >    1. I've been trying to understand how to
> correctly
> > >>>> use
> > >>>> > > > > > > RepeatedHolders
> > >>>> > > > > > > > >    of whatever type. For this discussion lets start
> > >>>> with a
> > >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out
> the
> > >>>> best
> > >>>> > way
> > >>>> > > to
> > >>>> > > > > > > create
> > >>>> > > > > > > > a
> > >>>> > > > > > > > > new
> > >>>> > > > > > > > >    one. I have not figured out where in the existing
> > >>>> drill
> > >>>> > code
> > >>>> > > > > > someone
> > >>>> > > > > > > > > does
> > >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> > Workspace
> > >>>> > object
> > >>>> > > > is
> > >>>> > > > > is
> > >>>> > > > > > > > null
> > >>>> > > > > > > > > to
> > >>>> > > > > > > > >    start with. I created a new one in the startup
> > >>>> section of
> > >>>> > > the
> > >>>> > > > > udf
> > >>>> > > > > > > but
> > >>>> > > > > > > > > the
> > >>>> > > > > > > > >    vector was null. I can find no reference in
> > creating
> > >>>> a new
> > >>>> > > > > > > > BigIntVector.
> > >>>> > > > > > > > >    There is a way to create a BigIntVector and I did
> > >>>> find an
> > >>>> > > > > example
> > >>>> > > > > > of
> > >>>> > > > > > > > >    creating a new VarCharVector but I can't do that
> > >>>> using the
> > >>>> > > > drill
> > >>>> > > > > > jar
> > >>>> > > > > > > > > files
> > >>>> > > > > > > > >    from 1.0. The
> > >>>> org.apache.drill.common.types.TypeProtos and
> > >>>> > > > > > > > >    the
> > >>>> org.apache.drill.common.types.TypeProtos.MinorType
> > >>>> > > classes
> > >>>> > > > > do
> > >>>> > > > > > > not
> > >>>> > > > > > > > >    appear to be accessible from the drill jar files.
> > >>>> > > > > > > > >    2. What is the best way to close out a UDF in the
> > >>>> event it
> > >>>> > > > > > generates
> > >>>> > > > > > > > an
> > >>>> > > > > > > > >    exception? Are there specific steps one should
> > >>>> follow to
> > >>>> > > make
> > >>>> > > > a
> > >>>> > > > > > > clean
> > >>>> > > > > > > > > exit
> > >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
That was impressively non-obvious.



On Sat, Jul 4, 2015 at 6:40 PM, Jim Bates <jb...@maprtech.com> wrote:

> I did get a new RepeatedBigIntHolder built and added a BigIntVector added
> to it. I'll try it in the UDF tomorrow and see if there is a difference in
> the ways I found to get a BufferAllocator.
>
> .
> .
> .
> @Inject DrillBuf buffer;
> @Workspace RepeatedBigIntHolder yList;
> .
> .
> .
> @Override
> public void setup() {
> .
> .
> .
> //org.apache.drill.exec.memory.BufferAllocator allocator =
> buffer.getAllocator();
> org.apache.drill.exec.memory.BufferAllocator allocator =  new
> org.apache.drill.exec.memory.TopLevelAllocator();
> yList = new RepeatedBigIntHolder();
> yList.vector = new
>
> org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
>
> org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
>
> org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
> allocator);
> .
> .
> .
> }
>
>
>
> On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:
>
> > I still have issues finding the correct way to create and use a
> > RepeatedHolder and Writers are a non starter for Workspace values. I can
> > make do with creating a concatenated string in a VarCharHolder for small
> > data sets to get past this in the short term and finish testing the
> output
> > values I expect but won't be able to do any scale till I figure out how
> to
> > make a repeated list.
> >
> > On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> >> Well... Converting from string to integers anyway... To many 4th of July
> >> Hot Dogs. going into nitrate overload. :)
> >>
> >> I am pulling an array of string values from json data. The string values
> >> are actually integers. I am converting to integers and summing each
> >> array entry to the final tally.
> >>
> >> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com> wrote:
> >>
> >>> Ted,
> >>>
> >>> Yes, I started out just getting a basic count to work. I am trying to
> >>> keep the workflow as close to a basic user as possible. As such, I am
> >>> building and using the MapR Apache Drill sandbox to test.
> >>>
> >>>
> >>>    1. Always look at the drillbits.log file to see if drill had any
> >>>    issues loading your UDF. That was where I learned that all
> workspace values
> >>>    needed to be holders
> >>>       -
> >>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
> >>>       function class
> >>>
>  com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
> >>>       xList. Aggregate function 'MyLinearRegression1' workspace
> variable 'xList'
> >>>       is of type 'interface
> >>>
>  org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
> >>>       Please change it to Holder type.
> >>>    2. Error messages:
> >>>       - If you get an error in this format it means that Drill can not
> >>>       find your function so it probably didn't load it. back to step 1:
> >>>          -
> >>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
> >>>          match found for function signature MyFunctionName(<ANY>)
> >>>       - If you get an error in this format it means that the function
> >>>       is there but Drill could not find a signature that matched the
> param types
> >>>       or param numbers you were passing it. The exact wording will
> change but
> >>>       the Missing function implementation is the key phrase to look
> for:
> >>>          -
> >>>          - Error: SYSTEM ERROR:
> >>>          org.apache.drill.exec.exception.SchemaChangeException:
> Failure while trying
> >>>          to materialize incoming schema.  Errors:
> >>>          - Error in expression at index -1.  Error: Missing function
> >>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
> expression: --UNKNOWN
> >>>          EXPRESSION--
> >>>       3. In your function definition for aggregate functions you need
> >>>    to set null processing to internal and your isRandom to false.
> Example
> >>>    below:
> >>>       -
> >>>       - @FunctionTemplate(name = "MyFunctionName", scope =
> >>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>>       isBinaryCommutative = false, costCategory =
> >>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>>
> >>> Below is an example from the Apache Drill tutorial data sets contained
> >>> in the MapR Apache Drill sandbox. I am pulling an array if string
> values
> >>> from json data. The string values are actually integers. I am
> converting to
> >>> string and summing each array entry to the final tally. This in no way
> >>> represents what this data was for but it did become a handy way for me
> to
> >>> peck out the "correct" way to build an aggregation UDF function
> >>>
> >>> @FunctionTemplate(name = "MyArraySum", scope =
> >>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> >>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> >>> isBinaryCommutative = false, costCategory =
> >>> FunctionTemplate.FunctionCostCategory.COMPLEX)
> >>> public static class MyArraySum implements DrillAggFunc {
> >>>
> >>> @Param RepeatedVarCharHolder listToSearch;
> >>> @Workspace NullableBigIntHolder count;
> >>> @Workspace NullableBigIntHolder sum;
> >>> @Workspace NullableVarCharHolder vc;
> >>> @Output BigIntHolder out;
> >>>
> >>> @Override
> >>> public void setup() {
> >>> count.value=0;
> >>> sum.value = 0;
> >>> }
> >>>
> >>> @Override
> >>> public void add() {
> >>> int c = listToSearch.end - listToSearch.start;
> >>> int val = 0;
> >>> try {
> >>> for(int i=0; i<c; i++){
> >>> listToSearch.vector.getAccessor().get(i, vc);
> >>> String inputStr =
> >>>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> >>> vc.end, vc.buffer);
> >>> val = Integer.parseInt(inputStr);
> >>> sum.value = sum.value + val;
> >>> }
> >>> } catch (Exception e) {
> >>> val = 0;
> >>> }
> >>> count.value = count.value + 1;
> >>> }
> >>>
> >>> Example select statement:
> >>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> >>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
> >>>
> >>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> Jim,
> >>>>
> >>>> I think that you may be having trouble with aggregators in general.
> >>>>
> >>>> Have you been able to build *any* aggregator of anything?  I haven't.
> >>>>
> >>>> When I try to build an aggregator of int's or doubles, I get a very
> >>>> persistent problem with Drill even seeing my aggregates:
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> >>>> cp.`employee.json`;*
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<ANY>)
> >>>>
> >>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 27: No match found for function signature
> >>>> sum_int(<ANY>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
> >>>> match
> >>>> found for function signature sum_int(<ANY>)*
> >>>>
> >>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
> >>>> cp.`employee.json`*;
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM
> >>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No
> match
> >>>> found for function signature sum_int(<NUMERIC>)
> >>>>
> >>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
> >>>> <init>
> >>>>
> >>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line
> 1,
> >>>> column 8 to line 1, column 40: No match found for function signature
> >>>> sum_int(<NUMERIC>)
> >>>>
> >>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
> >>>> match
> >>>> found for function signature sum_int(<NUMERIC>)*
> >>>>
> >>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
> >>>> <http://10.0.1.2:31010>] (state=,code=0)*
> >>>>
> >>>> 0: jdbc:drill:zk=local>
> >>>>
> >>>>
> >>>> It looks like there is some undocumented subtlety about how to
> register
> >>>> an
> >>>> aggregator.
> >>>>
> >>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> >>>>
> >>>> > I'm working on the same thing. I want to aggregate a list of values.
> >>>> It has
> >>>> > been a search and guess game for the most part. I'm still stuck in
> the
> >>>> > process of getting the values all into a list. The writers look
> >>>> interesting
> >>>> > but for aggregation functions  it looks like the input is the param
> >>>> and
> >>>> > output objects can't hold the aggregations steps. The Workspace is
> >>>> where
> >>>> > that happens. If I try and use a Writer in a workspace it won't load
> >>>> and
> >>>> > tells me to change it to Holders which was why I was using them to
> >>>> start
> >>>> > with. Maybe I'm missing the architecture of the agg function. It
> >>>> looked
> >>>> > like it was....
> >>>> >
> >>>> > @Param comes in -> initialize @Workspace vars in setup -> process
> data
> >>>> > through @Workspace vars in add -> finalize @Output in output.
> >>>> >
> >>>> > So I'm back to trying to figure out how to create a
> >>>> RepeatedBigIntHolder or
> >>>> > a RepeatedVarCharHolder...
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > I am working on trying to build any kind of list constructing
> >>>> aggregator
> >>>> > > and having absolute fits.
> >>>> > >
> >>>> > > To simplify life, I decided to just build a generic list builder
> >>>> that is
> >>>> > a
> >>>> > > scalar function that returns a list containing its argument.  Thus
> >>>> > zoop(3)
> >>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> >>>> > >
> >>>> > > The ComplexWriter looks like the place to go. As usual, the
> >>>> complete lack
> >>>> > > of comments in most of Drill makes this very hard since I have to
> >>>> guess
> >>>> > > what works and what doesn't.
> >>>> > >
> >>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
> >>>> method.  I
> >>>> > > used this in zip and it works nicely to construct lists for
> >>>> output.  I
> >>>> > note
> >>>> > > that the resulting ListWriter has a method copyReader(FieldReader
> >>>> var1)
> >>>> > > which looks really good.
> >>>> > >
> >>>> > > Unfortunately, the only implementation of copyReader() is in
> >>>> > > AbstractFieldWriter and it looks this:
> >>>> > >
> >>>> > > public void copyReader(FieldReader reader) {
> >>>> > >     this.fail("Copy FieldReader");
> >>>> > > }
> >>>> > >
> >>>> > > I would like to formally say at this point "WTF"?
> >>>> > >
> >>>> > > In digging in further, I see other methods that look handy like
> >>>> > >
> >>>> > > public void write(IntHolder holder) {
> >>>> > >     this.fail("Int");
> >>>> > > }
> >>>> > >
> >>>> > > And then in looking at implementations, it looks like there is a
> >>>> > > combinatorial explosion because every type seems to need a write
> >>>> method
> >>>> > for
> >>>> > > every other type.
> >>>> > >
> >>>> > > What is the thought here?  How can I copy an arbitrary value into
> a
> >>>> list?
> >>>> > >
> >>>> > > My next thought was to build code that dispatches on type.  There
> >>>> is a
> >>>> > > method called getType() on the FieldReader.  Unfortunately, that
> >>>> drives
> >>>> > > into code generated by protoc and I see no way to dispatch on the
> >>>> type of
> >>>> > > an incoming value.
> >>>> > >
> >>>> > >
> >>>> > > How is this supposed to work?
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <
> baid.mehant@gmail.com>
> >>>> > wrote:
> >>>> > >
> >>>> > > > For a detailed example on using ComplexWriter interface you can
> >>>> take a
> >>>> > > look
> >>>> > > > at the Mappify
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> >>>> > > > >
> >>>> > > > (kvgen) function. The function itself is very simple however it
> >>>> makes
> >>>> > use
> >>>> > > > of the utility methods in MappifyUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> >>>> > > > >
> >>>> > > > and MapUtility
> >>>> > > > <
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> >>>> > > > >
> >>>> > > > which perform most of the work.
> >>>> > > >
> >>>> > > > Currently we don't have a generic infrastructure to handle
> errors
> >>>> > coming
> >>>> > > > out of functions. However there is UserException, which when
> >>>> raised
> >>>> > will
> >>>> > > > make sure that Drill does not gobble up the error message in
> that
> >>>> > > > exception. So you can probably throw a UserException with the
> >>>> failing
> >>>> > > input
> >>>> > > > in your function to make sure it propagates to the user.
> >>>> > > >
> >>>> > > > Thanks
> >>>> > > > Mehant
> >>>> > > >
> >>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
> >>>> jacques@apache.org>
> >>>> > > wrote:
> >>>> > > >
> >>>> > > > > *Holders are for both input and output.  You can also use
> >>>> > CompleWriter
> >>>> > > > for
> >>>> > > > > output and FieldReader for input if you want to write or read
> a
> >>>> > complex
> >>>> > > > > value.
> >>>> > > > >
> >>>> > > > > I don't think we've provided a really clean way to construct a
> >>>> > > > > Repeated*Holder for output purposes.  You can probably do it
> by
> >>>> > > reaching
> >>>> > > > > into a bunch of internal interfaces in Drill.  However, I
> would
> >>>> > > recommend
> >>>> > > > > using the ComplexWriter output pattern for now.  This will be
> a
> >>>> > little
> >>>> > > > less
> >>>> > > > > efficient but substantially less brittle.  I suggest you open
> >>>> up a
> >>>> > jira
> >>>> > > > for
> >>>> > > > > using a Repeated*Holder as an output.
> >>>> > > > >
> >>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
> >>>> ted.dunning@gmail.com>
> >>>> > > > wrote:
> >>>> > > > >
> >>>> > > > > > Holders are for input, I think.
> >>>> > > > > >
> >>>> > > > > > Try the different kinds of writers.
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
> >>>> jbates@maprtech.com>
> >>>> > > > wrote:
> >>>> > > > > >
> >>>> > > > > > > Using a repeatedholder as a @param I've got working. I was
> >>>> > working
> >>>> > > > on a
> >>>> > > > > > > custom aggregator function using DrillAggFunc. In this I
> >>>> can do
> >>>> > > > simple
> >>>> > > > > > > things but If I want to build a list values and do
> >>>> something with
> >>>> > > it
> >>>> > > > in
> >>>> > > > > > the
> >>>> > > > > > > final output method I think I need to use RepeatedHolders
> >>>> in the
> >>>> > > > > > > @Workspace. To do that I need to create a new one in the
> >>>> setup
> >>>> > > > method.
> >>>> > > > > I
> >>>> > > > > > > can't get one built. They all require a BufferAllocator to
> >>>> be
> >>>> > > passed
> >>>> > > > in
> >>>> > > > > > to
> >>>> > > > > > > build it. I have not found a way to get an allocator yet.
> >>>> Any
> >>>> > > > > > suggestions?
> >>>> > > > > > >
> >>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> >>>> > ted.dunning@gmail.com
> >>>> > > >
> >>>> > > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > > If you look at the zip function in
> >>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions
> you
> >>>> can
> >>>> > > have
> >>>> > > > an
> >>>> > > > > > > > example of building a structure.
> >>>> > > > > > > >
> >>>> > > > > > > > The basic idea is that your output is denoted as
> >>>> > > > > > > >
> >>>> > > > > > > >         @Output
> >>>> > > > > > > >         BaseWriter.ComplexWriter writer;
> >>>> > > > > > > >
> >>>> > > > > > > > The pattern for building a list of lists of integers is
> >>>> like
> >>>> > > this:
> >>>> > > > > > > >
> >>>> > > > > > > >         writer.setValueCount(n);
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         BaseWriter.ListWriter outer =
> writer.rootAsList();
> >>>> > > > > > > >         outer.start(); // [ outer list
> >>>> > > > > > > >         ...
> >>>> > > > > > > >         // for each inner list
> >>>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
> >>>> > > > > > > >             inner.start();
> >>>> > > > > > > >             // for each inner list element
> >>>> > > > > > > >
>  inner.integer().writeInt(accessor.get(i));
> >>>> > > > > > > >             }
> >>>> > > > > > > >             inner.end();   // ] inner list
> >>>> > > > > > > >         }
> >>>> > > > > > > >         outer.end(); // ] outer list
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> >>>> > jbates@maprtech.com>
> >>>> > > > > > wrote:
> >>>> > > > > > > >
> >>>> > > > > > > > > I have working aggregation and simple UDFs. I've been
> >>>> trying
> >>>> > to
> >>>> > > > > > > document
> >>>> > > > > > > > > and understand each of the options available in a
> Drill
> >>>> UDF.
> >>>> > > > > > > > Understanding
> >>>> > > > > > > > > the different FunctionScope's, the ones that are
> >>>> allowed, the
> >>>> > > > ones
> >>>> > > > > > that
> >>>> > > > > > > > are
> >>>> > > > > > > > > not. The impact of different cost categories. The
> >>>> different
> >>>> > > > steps
> >>>> > > > > > > needed
> >>>> > > > > > > > > to understand handling any of the supported data types
> >>>> and
> >>>> > > > > > structures
> >>>> > > > > > > in
> >>>> > > > > > > > > drill.
> >>>> > > > > > > > >
> >>>> > > > > > > > > Here are a few of my current road blocks. Any pointers
> >>>> would
> >>>> > be
> >>>> > > > > > greatly
> >>>> > > > > > > > > appreciated.
> >>>> > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > > >    1. I've been trying to understand how to correctly
> >>>> use
> >>>> > > > > > > RepeatedHolders
> >>>> > > > > > > > >    of whatever type. For this discussion lets start
> >>>> with a
> >>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the
> >>>> best
> >>>> > way
> >>>> > > to
> >>>> > > > > > > create
> >>>> > > > > > > > a
> >>>> > > > > > > > > new
> >>>> > > > > > > > >    one. I have not figured out where in the existing
> >>>> drill
> >>>> > code
> >>>> > > > > > someone
> >>>> > > > > > > > > does
> >>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a
> Workspace
> >>>> > object
> >>>> > > > is
> >>>> > > > > is
> >>>> > > > > > > > null
> >>>> > > > > > > > > to
> >>>> > > > > > > > >    start with. I created a new one in the startup
> >>>> section of
> >>>> > > the
> >>>> > > > > udf
> >>>> > > > > > > but
> >>>> > > > > > > > > the
> >>>> > > > > > > > >    vector was null. I can find no reference in
> creating
> >>>> a new
> >>>> > > > > > > > BigIntVector.
> >>>> > > > > > > > >    There is a way to create a BigIntVector and I did
> >>>> find an
> >>>> > > > > example
> >>>> > > > > > of
> >>>> > > > > > > > >    creating a new VarCharVector but I can't do that
> >>>> using the
> >>>> > > > drill
> >>>> > > > > > jar
> >>>> > > > > > > > > files
> >>>> > > > > > > > >    from 1.0. The
> >>>> org.apache.drill.common.types.TypeProtos and
> >>>> > > > > > > > >    the
> >>>> org.apache.drill.common.types.TypeProtos.MinorType
> >>>> > > classes
> >>>> > > > > do
> >>>> > > > > > > not
> >>>> > > > > > > > >    appear to be accessible from the drill jar files.
> >>>> > > > > > > > >    2. What is the best way to close out a UDF in the
> >>>> event it
> >>>> > > > > > generates
> >>>> > > > > > > > an
> >>>> > > > > > > > >    exception? Are there specific steps one should
> >>>> follow to
> >>>> > > make
> >>>> > > > a
> >>>> > > > > > > clean
> >>>> > > > > > > > > exit
> >>>> > > > > > > > >    in a catch block that are beneficial to Drill?
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
I did get a new RepeatedBigIntHolder built and added a BigIntVector added
to it. I'll try it in the UDF tomorrow and see if there is a difference in
the ways I found to get a BufferAllocator.

.
.
.
@Inject DrillBuf buffer;
@Workspace RepeatedBigIntHolder yList;
.
.
.
@Override
public void setup() {
.
.
.
//org.apache.drill.exec.memory.BufferAllocator allocator =
buffer.getAllocator();
org.apache.drill.exec.memory.BufferAllocator allocator =  new
org.apache.drill.exec.memory.TopLevelAllocator();
yList = new RepeatedBigIntHolder();
yList.vector = new
org.apache.drill.exec.vector.BigIntVector(org.apache.drill.exec.record.MaterializedField.create(new
org.apache.drill.common.expression.SchemaPath("bigints",org.apache.drill.common.expression.ExpressionPosition.UNKNOWN),
org.apache.drill.common.types.Types.optional(org.apache.drill.common.types.TypeProtos.MinorType.BIGINT)),
allocator);
.
.
.
}



On Sat, Jul 4, 2015 at 7:39 PM, Jim Bates <jb...@maprtech.com> wrote:

> I still have issues finding the correct way to create and use a
> RepeatedHolder and Writers are a non starter for Workspace values. I can
> make do with creating a concatenated string in a VarCharHolder for small
> data sets to get past this in the short term and finish testing the output
> values I expect but won't be able to do any scale till I figure out how to
> make a repeated list.
>
> On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com> wrote:
>
>> Well... Converting from string to integers anyway... To many 4th of July
>> Hot Dogs. going into nitrate overload. :)
>>
>> I am pulling an array of string values from json data. The string values
>> are actually integers. I am converting to integers and summing each
>> array entry to the final tally.
>>
>> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com> wrote:
>>
>>> Ted,
>>>
>>> Yes, I started out just getting a basic count to work. I am trying to
>>> keep the workflow as close to a basic user as possible. As such, I am
>>> building and using the MapR Apache Drill sandbox to test.
>>>
>>>
>>>    1. Always look at the drillbits.log file to see if drill had any
>>>    issues loading your UDF. That was where I learned that all workspace values
>>>    needed to be holders
>>>       -
>>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
>>>       function class
>>>       com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
>>>       xList. Aggregate function 'MyLinearRegression1' workspace variable 'xList'
>>>       is of type 'interface
>>>       org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>>>       Please change it to Holder type.
>>>    2. Error messages:
>>>       - If you get an error in this format it means that Drill can not
>>>       find your function so it probably didn't load it. back to step 1:
>>>          -
>>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
>>>          match found for function signature MyFunctionName(<ANY>)
>>>       - If you get an error in this format it means that the function
>>>       is there but Drill could not find a signature that matched the param types
>>>       or param numbers you were passing it. The exact wording will change but
>>>       the Missing function implementation is the key phrase to look for:
>>>          -
>>>          - Error: SYSTEM ERROR:
>>>          org.apache.drill.exec.exception.SchemaChangeException: Failure while trying
>>>          to materialize incoming schema.  Errors:
>>>          - Error in expression at index -1.  Error: Missing function
>>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full expression: --UNKNOWN
>>>          EXPRESSION--
>>>       3. In your function definition for aggregate functions you need
>>>    to set null processing to internal and your isRandom to false. Example
>>>    below:
>>>       -
>>>       - @FunctionTemplate(name = "MyFunctionName", scope =
>>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>>       isBinaryCommutative = false, costCategory =
>>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>>>
>>> Below is an example from the Apache Drill tutorial data sets contained
>>> in the MapR Apache Drill sandbox. I am pulling an array if string values
>>> from json data. The string values are actually integers. I am converting to
>>> string and summing each array entry to the final tally. This in no way
>>> represents what this data was for but it did become a handy way for me to
>>> peck out the "correct" way to build an aggregation UDF function
>>>
>>> @FunctionTemplate(name = "MyArraySum", scope =
>>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>> isBinaryCommutative = false, costCategory =
>>> FunctionTemplate.FunctionCostCategory.COMPLEX)
>>> public static class MyArraySum implements DrillAggFunc {
>>>
>>> @Param RepeatedVarCharHolder listToSearch;
>>> @Workspace NullableBigIntHolder count;
>>> @Workspace NullableBigIntHolder sum;
>>> @Workspace NullableVarCharHolder vc;
>>> @Output BigIntHolder out;
>>>
>>> @Override
>>> public void setup() {
>>> count.value=0;
>>> sum.value = 0;
>>> }
>>>
>>> @Override
>>> public void add() {
>>> int c = listToSearch.end - listToSearch.start;
>>> int val = 0;
>>> try {
>>> for(int i=0; i<c; i++){
>>> listToSearch.vector.getAccessor().get(i, vc);
>>> String inputStr =
>>> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
>>> vc.end, vc.buffer);
>>> val = Integer.parseInt(inputStr);
>>> sum.value = sum.value + val;
>>> }
>>> } catch (Exception e) {
>>> val = 0;
>>> }
>>> count.value = count.value + 1;
>>> }
>>>
>>> Example select statement:
>>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
>>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
>>>
>>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>> Jim,
>>>>
>>>> I think that you may be having trouble with aggregators in general.
>>>>
>>>> Have you been able to build *any* aggregator of anything?  I haven't.
>>>>
>>>> When I try to build an aggregator of int's or doubles, I get a very
>>>> persistent problem with Drill even seeing my aggregates:
>>>>
>>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>>>> cp.`employee.json`;*
>>>>
>>>> Jul 04, 2015 4:19:35 PM
>>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>>
>>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>>> found for function signature sum_int(<ANY>)
>>>>
>>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
>>>> <init>
>>>>
>>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>>> column 8 to line 1, column 27: No match found for function signature
>>>> sum_int(<ANY>)
>>>>
>>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No
>>>> match
>>>> found for function signature sum_int(<ANY>)*
>>>>
>>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
>>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>>
>>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
>>>> cp.`employee.json`*;
>>>>
>>>> Jul 04, 2015 4:19:45 PM
>>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>>
>>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>>> found for function signature sum_int(<NUMERIC>)
>>>>
>>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
>>>> <init>
>>>>
>>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>>> column 8 to line 1, column 40: No match found for function signature
>>>> sum_int(<NUMERIC>)
>>>>
>>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No
>>>> match
>>>> found for function signature sum_int(<NUMERIC>)*
>>>>
>>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
>>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>>
>>>> 0: jdbc:drill:zk=local>
>>>>
>>>>
>>>> It looks like there is some undocumented subtlety about how to register
>>>> an
>>>> aggregator.
>>>>
>>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com> wrote:
>>>>
>>>> > I'm working on the same thing. I want to aggregate a list of values.
>>>> It has
>>>> > been a search and guess game for the most part. I'm still stuck in the
>>>> > process of getting the values all into a list. The writers look
>>>> interesting
>>>> > but for aggregation functions  it looks like the input is the param
>>>> and
>>>> > output objects can't hold the aggregations steps. The Workspace is
>>>> where
>>>> > that happens. If I try and use a Writer in a workspace it won't load
>>>> and
>>>> > tells me to change it to Holders which was why I was using them to
>>>> start
>>>> > with. Maybe I'm missing the architecture of the agg function. It
>>>> looked
>>>> > like it was....
>>>> >
>>>> > @Param comes in -> initialize @Workspace vars in setup -> process data
>>>> > through @Workspace vars in add -> finalize @Output in output.
>>>> >
>>>> > So I'm back to trying to figure out how to create a
>>>> RepeatedBigIntHolder or
>>>> > a RepeatedVarCharHolder...
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > I am working on trying to build any kind of list constructing
>>>> aggregator
>>>> > > and having absolute fits.
>>>> > >
>>>> > > To simplify life, I decided to just build a generic list builder
>>>> that is
>>>> > a
>>>> > > scalar function that returns a list containing its argument.  Thus
>>>> > zoop(3)
>>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>>>> > >
>>>> > > The ComplexWriter looks like the place to go. As usual, the
>>>> complete lack
>>>> > > of comments in most of Drill makes this very hard since I have to
>>>> guess
>>>> > > what works and what doesn't.
>>>> > >
>>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>>>> method.  I
>>>> > > used this in zip and it works nicely to construct lists for
>>>> output.  I
>>>> > note
>>>> > > that the resulting ListWriter has a method copyReader(FieldReader
>>>> var1)
>>>> > > which looks really good.
>>>> > >
>>>> > > Unfortunately, the only implementation of copyReader() is in
>>>> > > AbstractFieldWriter and it looks this:
>>>> > >
>>>> > > public void copyReader(FieldReader reader) {
>>>> > >     this.fail("Copy FieldReader");
>>>> > > }
>>>> > >
>>>> > > I would like to formally say at this point "WTF"?
>>>> > >
>>>> > > In digging in further, I see other methods that look handy like
>>>> > >
>>>> > > public void write(IntHolder holder) {
>>>> > >     this.fail("Int");
>>>> > > }
>>>> > >
>>>> > > And then in looking at implementations, it looks like there is a
>>>> > > combinatorial explosion because every type seems to need a write
>>>> method
>>>> > for
>>>> > > every other type.
>>>> > >
>>>> > > What is the thought here?  How can I copy an arbitrary value into a
>>>> list?
>>>> > >
>>>> > > My next thought was to build code that dispatches on type.  There
>>>> is a
>>>> > > method called getType() on the FieldReader.  Unfortunately, that
>>>> drives
>>>> > > into code generated by protoc and I see no way to dispatch on the
>>>> type of
>>>> > > an incoming value.
>>>> > >
>>>> > >
>>>> > > How is this supposed to work?
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > > > For a detailed example on using ComplexWriter interface you can
>>>> take a
>>>> > > look
>>>> > > > at the Mappify
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>>>> > > > >
>>>> > > > (kvgen) function. The function itself is very simple however it
>>>> makes
>>>> > use
>>>> > > > of the utility methods in MappifyUtility
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>>>> > > > >
>>>> > > > and MapUtility
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>>>> > > > >
>>>> > > > which perform most of the work.
>>>> > > >
>>>> > > > Currently we don't have a generic infrastructure to handle errors
>>>> > coming
>>>> > > > out of functions. However there is UserException, which when
>>>> raised
>>>> > will
>>>> > > > make sure that Drill does not gobble up the error message in that
>>>> > > > exception. So you can probably throw a UserException with the
>>>> failing
>>>> > > input
>>>> > > > in your function to make sure it propagates to the user.
>>>> > > >
>>>> > > > Thanks
>>>> > > > Mehant
>>>> > > >
>>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <
>>>> jacques@apache.org>
>>>> > > wrote:
>>>> > > >
>>>> > > > > *Holders are for both input and output.  You can also use
>>>> > CompleWriter
>>>> > > > for
>>>> > > > > output and FieldReader for input if you want to write or read a
>>>> > complex
>>>> > > > > value.
>>>> > > > >
>>>> > > > > I don't think we've provided a really clean way to construct a
>>>> > > > > Repeated*Holder for output purposes.  You can probably do it by
>>>> > > reaching
>>>> > > > > into a bunch of internal interfaces in Drill.  However, I would
>>>> > > recommend
>>>> > > > > using the ComplexWriter output pattern for now.  This will be a
>>>> > little
>>>> > > > less
>>>> > > > > efficient but substantially less brittle.  I suggest you open
>>>> up a
>>>> > jira
>>>> > > > for
>>>> > > > > using a Repeated*Holder as an output.
>>>> > > > >
>>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>>>> ted.dunning@gmail.com>
>>>> > > > wrote:
>>>> > > > >
>>>> > > > > > Holders are for input, I think.
>>>> > > > > >
>>>> > > > > > Try the different kinds of writers.
>>>> > > > > >
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
>>>> jbates@maprtech.com>
>>>> > > > wrote:
>>>> > > > > >
>>>> > > > > > > Using a repeatedholder as a @param I've got working. I was
>>>> > working
>>>> > > > on a
>>>> > > > > > > custom aggregator function using DrillAggFunc. In this I
>>>> can do
>>>> > > > simple
>>>> > > > > > > things but If I want to build a list values and do
>>>> something with
>>>> > > it
>>>> > > > in
>>>> > > > > > the
>>>> > > > > > > final output method I think I need to use RepeatedHolders
>>>> in the
>>>> > > > > > > @Workspace. To do that I need to create a new one in the
>>>> setup
>>>> > > > method.
>>>> > > > > I
>>>> > > > > > > can't get one built. They all require a BufferAllocator to
>>>> be
>>>> > > passed
>>>> > > > in
>>>> > > > > > to
>>>> > > > > > > build it. I have not found a way to get an allocator yet.
>>>> Any
>>>> > > > > > suggestions?
>>>> > > > > > >
>>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
>>>> > ted.dunning@gmail.com
>>>> > > >
>>>> > > > > > wrote:
>>>> > > > > > >
>>>> > > > > > > > If you look at the zip function in
>>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you
>>>> can
>>>> > > have
>>>> > > > an
>>>> > > > > > > > example of building a structure.
>>>> > > > > > > >
>>>> > > > > > > > The basic idea is that your output is denoted as
>>>> > > > > > > >
>>>> > > > > > > >         @Output
>>>> > > > > > > >         BaseWriter.ComplexWriter writer;
>>>> > > > > > > >
>>>> > > > > > > > The pattern for building a list of lists of integers is
>>>> like
>>>> > > this:
>>>> > > > > > > >
>>>> > > > > > > >         writer.setValueCount(n);
>>>> > > > > > > >         ...
>>>> > > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
>>>> > > > > > > >         outer.start(); // [ outer list
>>>> > > > > > > >         ...
>>>> > > > > > > >         // for each inner list
>>>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
>>>> > > > > > > >             inner.start();
>>>> > > > > > > >             // for each inner list element
>>>> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
>>>> > > > > > > >             }
>>>> > > > > > > >             inner.end();   // ] inner list
>>>> > > > > > > >         }
>>>> > > > > > > >         outer.end(); // ] outer list
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
>>>> > jbates@maprtech.com>
>>>> > > > > > wrote:
>>>> > > > > > > >
>>>> > > > > > > > > I have working aggregation and simple UDFs. I've been
>>>> trying
>>>> > to
>>>> > > > > > > document
>>>> > > > > > > > > and understand each of the options available in a Drill
>>>> UDF.
>>>> > > > > > > > Understanding
>>>> > > > > > > > > the different FunctionScope's, the ones that are
>>>> allowed, the
>>>> > > > ones
>>>> > > > > > that
>>>> > > > > > > > are
>>>> > > > > > > > > not. The impact of different cost categories. The
>>>> different
>>>> > > > steps
>>>> > > > > > > needed
>>>> > > > > > > > > to understand handling any of the supported data types
>>>> and
>>>> > > > > > structures
>>>> > > > > > > in
>>>> > > > > > > > > drill.
>>>> > > > > > > > >
>>>> > > > > > > > > Here are a few of my current road blocks. Any pointers
>>>> would
>>>> > be
>>>> > > > > > greatly
>>>> > > > > > > > > appreciated.
>>>> > > > > > > > >
>>>> > > > > > > > >
>>>> > > > > > > > >    1. I've been trying to understand how to correctly
>>>> use
>>>> > > > > > > RepeatedHolders
>>>> > > > > > > > >    of whatever type. For this discussion lets start
>>>> with a
>>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the
>>>> best
>>>> > way
>>>> > > to
>>>> > > > > > > create
>>>> > > > > > > > a
>>>> > > > > > > > > new
>>>> > > > > > > > >    one. I have not figured out where in the existing
>>>> drill
>>>> > code
>>>> > > > > > someone
>>>> > > > > > > > > does
>>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
>>>> > object
>>>> > > > is
>>>> > > > > is
>>>> > > > > > > > null
>>>> > > > > > > > > to
>>>> > > > > > > > >    start with. I created a new one in the startup
>>>> section of
>>>> > > the
>>>> > > > > udf
>>>> > > > > > > but
>>>> > > > > > > > > the
>>>> > > > > > > > >    vector was null. I can find no reference in creating
>>>> a new
>>>> > > > > > > > BigIntVector.
>>>> > > > > > > > >    There is a way to create a BigIntVector and I did
>>>> find an
>>>> > > > > example
>>>> > > > > > of
>>>> > > > > > > > >    creating a new VarCharVector but I can't do that
>>>> using the
>>>> > > > drill
>>>> > > > > > jar
>>>> > > > > > > > > files
>>>> > > > > > > > >    from 1.0. The
>>>> org.apache.drill.common.types.TypeProtos and
>>>> > > > > > > > >    the
>>>> org.apache.drill.common.types.TypeProtos.MinorType
>>>> > > classes
>>>> > > > > do
>>>> > > > > > > not
>>>> > > > > > > > >    appear to be accessible from the drill jar files.
>>>> > > > > > > > >    2. What is the best way to close out a UDF in the
>>>> event it
>>>> > > > > > generates
>>>> > > > > > > > an
>>>> > > > > > > > >    exception? Are there specific steps one should
>>>> follow to
>>>> > > make
>>>> > > > a
>>>> > > > > > > clean
>>>> > > > > > > > > exit
>>>> > > > > > > > >    in a catch block that are beneficial to Drill?
>>>> > > > > > > > >
>>>> > > > > > > >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
I still have issues finding the correct way to create and use a
RepeatedHolder and Writers are a non starter for Workspace values. I can
make do with creating a concatenated string in a VarCharHolder for small
data sets to get past this in the short term and finish testing the output
values I expect but won't be able to do any scale till I figure out how to
make a repeated list.

On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <jb...@maprtech.com> wrote:

> Well... Converting from string to integers anyway... To many 4th of July
> Hot Dogs. going into nitrate overload. :)
>
> I am pulling an array of string values from json data. The string values
> are actually integers. I am converting to integers and summing each array
> entry to the final tally.
>
> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com> wrote:
>
>> Ted,
>>
>> Yes, I started out just getting a basic count to work. I am trying to
>> keep the workflow as close to a basic user as possible. As such, I am
>> building and using the MapR Apache Drill sandbox to test.
>>
>>
>>    1. Always look at the drillbits.log file to see if drill had any
>>    issues loading your UDF. That was where I learned that all workspace values
>>    needed to be holders
>>       -
>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
>>       function class
>>       com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
>>       xList. Aggregate function 'MyLinearRegression1' workspace variable 'xList'
>>       is of type 'interface
>>       org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>>       Please change it to Holder type.
>>    2. Error messages:
>>       - If you get an error in this format it means that Drill can not
>>       find your function so it probably didn't load it. back to step 1:
>>          -
>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
>>          match found for function signature MyFunctionName(<ANY>)
>>       - If you get an error in this format it means that the function is
>>       there but Drill could not find a signature that matched the param types or
>>       param numbers you were passing it. The exact wording will change but
>>       the Missing function implementation is the key phrase to look for:
>>          -
>>          - Error: SYSTEM ERROR:
>>          org.apache.drill.exec.exception.SchemaChangeException: Failure while trying
>>          to materialize incoming schema.  Errors:
>>          - Error in expression at index -1.  Error: Missing function
>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full expression: --UNKNOWN
>>          EXPRESSION--
>>       3. In your function definition for aggregate functions you need to
>>    set null processing to internal and your isRandom to false. Example below:
>>       -
>>       - @FunctionTemplate(name = "MyFunctionName", scope =
>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>       isBinaryCommutative = false, costCategory =
>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>>
>> Below is an example from the Apache Drill tutorial data sets contained in
>> the MapR Apache Drill sandbox. I am pulling an array if string values from
>> json data. The string values are actually integers. I am converting to
>> string and summing each array entry to the final tally. This in no way
>> represents what this data was for but it did become a handy way for me to
>> peck out the "correct" way to build an aggregation UDF function
>>
>> @FunctionTemplate(name = "MyArraySum", scope =
>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>> isBinaryCommutative = false, costCategory =
>> FunctionTemplate.FunctionCostCategory.COMPLEX)
>> public static class MyArraySum implements DrillAggFunc {
>>
>> @Param RepeatedVarCharHolder listToSearch;
>> @Workspace NullableBigIntHolder count;
>> @Workspace NullableBigIntHolder sum;
>> @Workspace NullableVarCharHolder vc;
>> @Output BigIntHolder out;
>>
>> @Override
>> public void setup() {
>> count.value=0;
>> sum.value = 0;
>> }
>>
>> @Override
>> public void add() {
>> int c = listToSearch.end - listToSearch.start;
>> int val = 0;
>> try {
>> for(int i=0; i<c; i++){
>> listToSearch.vector.getAccessor().get(i, vc);
>> String inputStr =
>> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
>> vc.end, vc.buffer);
>> val = Integer.parseInt(inputStr);
>> sum.value = sum.value + val;
>> }
>> } catch (Exception e) {
>> val = 0;
>> }
>> count.value = count.value + 1;
>> }
>>
>> Example select statement:
>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
>>
>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>> Jim,
>>>
>>> I think that you may be having trouble with aggregators in general.
>>>
>>> Have you been able to build *any* aggregator of anything?  I haven't.
>>>
>>> When I try to build an aggregator of int's or doubles, I get a very
>>> persistent problem with Drill even seeing my aggregates:
>>>
>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>>> cp.`employee.json`;*
>>>
>>> Jul 04, 2015 4:19:35 PM
>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>
>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>> found for function signature sum_int(<ANY>)
>>>
>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
>>> <init>
>>>
>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>> column 8 to line 1, column 27: No match found for function signature
>>> sum_int(<ANY>)
>>>
>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match
>>> found for function signature sum_int(<ANY>)*
>>>
>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>
>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
>>> cp.`employee.json`*;
>>>
>>> Jul 04, 2015 4:19:45 PM
>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>
>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>> found for function signature sum_int(<NUMERIC>)
>>>
>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
>>> <init>
>>>
>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>> column 8 to line 1, column 40: No match found for function signature
>>> sum_int(<NUMERIC>)
>>>
>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match
>>> found for function signature sum_int(<NUMERIC>)*
>>>
>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>
>>> 0: jdbc:drill:zk=local>
>>>
>>>
>>> It looks like there is some undocumented subtlety about how to register
>>> an
>>> aggregator.
>>>
>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com> wrote:
>>>
>>> > I'm working on the same thing. I want to aggregate a list of values.
>>> It has
>>> > been a search and guess game for the most part. I'm still stuck in the
>>> > process of getting the values all into a list. The writers look
>>> interesting
>>> > but for aggregation functions  it looks like the input is the param and
>>> > output objects can't hold the aggregations steps. The Workspace is
>>> where
>>> > that happens. If I try and use a Writer in a workspace it won't load
>>> and
>>> > tells me to change it to Holders which was why I was using them to
>>> start
>>> > with. Maybe I'm missing the architecture of the agg function. It looked
>>> > like it was....
>>> >
>>> > @Param comes in -> initialize @Workspace vars in setup -> process data
>>> > through @Workspace vars in add -> finalize @Output in output.
>>> >
>>> > So I'm back to trying to figure out how to create a
>>> RepeatedBigIntHolder or
>>> > a RepeatedVarCharHolder...
>>> >
>>> >
>>> >
>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >
>>> > > I am working on trying to build any kind of list constructing
>>> aggregator
>>> > > and having absolute fits.
>>> > >
>>> > > To simplify life, I decided to just build a generic list builder
>>> that is
>>> > a
>>> > > scalar function that returns a list containing its argument.  Thus
>>> > zoop(3)
>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>>> > >
>>> > > The ComplexWriter looks like the place to go. As usual, the complete
>>> lack
>>> > > of comments in most of Drill makes this very hard since I have to
>>> guess
>>> > > what works and what doesn't.
>>> > >
>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>>> method.  I
>>> > > used this in zip and it works nicely to construct lists for output.
>>> I
>>> > note
>>> > > that the resulting ListWriter has a method copyReader(FieldReader
>>> var1)
>>> > > which looks really good.
>>> > >
>>> > > Unfortunately, the only implementation of copyReader() is in
>>> > > AbstractFieldWriter and it looks this:
>>> > >
>>> > > public void copyReader(FieldReader reader) {
>>> > >     this.fail("Copy FieldReader");
>>> > > }
>>> > >
>>> > > I would like to formally say at this point "WTF"?
>>> > >
>>> > > In digging in further, I see other methods that look handy like
>>> > >
>>> > > public void write(IntHolder holder) {
>>> > >     this.fail("Int");
>>> > > }
>>> > >
>>> > > And then in looking at implementations, it looks like there is a
>>> > > combinatorial explosion because every type seems to need a write
>>> method
>>> > for
>>> > > every other type.
>>> > >
>>> > > What is the thought here?  How can I copy an arbitrary value into a
>>> list?
>>> > >
>>> > > My next thought was to build code that dispatches on type.  There is
>>> a
>>> > > method called getType() on the FieldReader.  Unfortunately, that
>>> drives
>>> > > into code generated by protoc and I see no way to dispatch on the
>>> type of
>>> > > an incoming value.
>>> > >
>>> > >
>>> > > How is this supposed to work?
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > For a detailed example on using ComplexWriter interface you can
>>> take a
>>> > > look
>>> > > > at the Mappify
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>>> > > > >
>>> > > > (kvgen) function. The function itself is very simple however it
>>> makes
>>> > use
>>> > > > of the utility methods in MappifyUtility
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>>> > > > >
>>> > > > and MapUtility
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>>> > > > >
>>> > > > which perform most of the work.
>>> > > >
>>> > > > Currently we don't have a generic infrastructure to handle errors
>>> > coming
>>> > > > out of functions. However there is UserException, which when raised
>>> > will
>>> > > > make sure that Drill does not gobble up the error message in that
>>> > > > exception. So you can probably throw a UserException with the
>>> failing
>>> > > input
>>> > > > in your function to make sure it propagates to the user.
>>> > > >
>>> > > > Thanks
>>> > > > Mehant
>>> > > >
>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <jacques@apache.org
>>> >
>>> > > wrote:
>>> > > >
>>> > > > > *Holders are for both input and output.  You can also use
>>> > CompleWriter
>>> > > > for
>>> > > > > output and FieldReader for input if you want to write or read a
>>> > complex
>>> > > > > value.
>>> > > > >
>>> > > > > I don't think we've provided a really clean way to construct a
>>> > > > > Repeated*Holder for output purposes.  You can probably do it by
>>> > > reaching
>>> > > > > into a bunch of internal interfaces in Drill.  However, I would
>>> > > recommend
>>> > > > > using the ComplexWriter output pattern for now.  This will be a
>>> > little
>>> > > > less
>>> > > > > efficient but substantially less brittle.  I suggest you open up
>>> a
>>> > jira
>>> > > > for
>>> > > > > using a Repeated*Holder as an output.
>>> > > > >
>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > > Holders are for input, I think.
>>> > > > > >
>>> > > > > > Try the different kinds of writers.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
>>> jbates@maprtech.com>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Using a repeatedholder as a @param I've got working. I was
>>> > working
>>> > > > on a
>>> > > > > > > custom aggregator function using DrillAggFunc. In this I can
>>> do
>>> > > > simple
>>> > > > > > > things but If I want to build a list values and do something
>>> with
>>> > > it
>>> > > > in
>>> > > > > > the
>>> > > > > > > final output method I think I need to use RepeatedHolders in
>>> the
>>> > > > > > > @Workspace. To do that I need to create a new one in the
>>> setup
>>> > > > method.
>>> > > > > I
>>> > > > > > > can't get one built. They all require a BufferAllocator to be
>>> > > passed
>>> > > > in
>>> > > > > > to
>>> > > > > > > build it. I have not found a way to get an allocator yet. Any
>>> > > > > > suggestions?
>>> > > > > > >
>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
>>> > ted.dunning@gmail.com
>>> > > >
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > If you look at the zip function in
>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you
>>> can
>>> > > have
>>> > > > an
>>> > > > > > > > example of building a structure.
>>> > > > > > > >
>>> > > > > > > > The basic idea is that your output is denoted as
>>> > > > > > > >
>>> > > > > > > >         @Output
>>> > > > > > > >         BaseWriter.ComplexWriter writer;
>>> > > > > > > >
>>> > > > > > > > The pattern for building a list of lists of integers is
>>> like
>>> > > this:
>>> > > > > > > >
>>> > > > > > > >         writer.setValueCount(n);
>>> > > > > > > >         ...
>>> > > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
>>> > > > > > > >         outer.start(); // [ outer list
>>> > > > > > > >         ...
>>> > > > > > > >         // for each inner list
>>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
>>> > > > > > > >             inner.start();
>>> > > > > > > >             // for each inner list element
>>> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
>>> > > > > > > >             }
>>> > > > > > > >             inner.end();   // ] inner list
>>> > > > > > > >         }
>>> > > > > > > >         outer.end(); // ] outer list
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
>>> > jbates@maprtech.com>
>>> > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > I have working aggregation and simple UDFs. I've been
>>> trying
>>> > to
>>> > > > > > > document
>>> > > > > > > > > and understand each of the options available in a Drill
>>> UDF.
>>> > > > > > > > Understanding
>>> > > > > > > > > the different FunctionScope's, the ones that are
>>> allowed, the
>>> > > > ones
>>> > > > > > that
>>> > > > > > > > are
>>> > > > > > > > > not. The impact of different cost categories. The
>>> different
>>> > > > steps
>>> > > > > > > needed
>>> > > > > > > > > to understand handling any of the supported data types
>>> and
>>> > > > > > structures
>>> > > > > > > in
>>> > > > > > > > > drill.
>>> > > > > > > > >
>>> > > > > > > > > Here are a few of my current road blocks. Any pointers
>>> would
>>> > be
>>> > > > > > greatly
>>> > > > > > > > > appreciated.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >    1. I've been trying to understand how to correctly use
>>> > > > > > > RepeatedHolders
>>> > > > > > > > >    of whatever type. For this discussion lets start with
>>> a
>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the
>>> best
>>> > way
>>> > > to
>>> > > > > > > create
>>> > > > > > > > a
>>> > > > > > > > > new
>>> > > > > > > > >    one. I have not figured out where in the existing
>>> drill
>>> > code
>>> > > > > > someone
>>> > > > > > > > > does
>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
>>> > object
>>> > > > is
>>> > > > > is
>>> > > > > > > > null
>>> > > > > > > > > to
>>> > > > > > > > >    start with. I created a new one in the startup
>>> section of
>>> > > the
>>> > > > > udf
>>> > > > > > > but
>>> > > > > > > > > the
>>> > > > > > > > >    vector was null. I can find no reference in creating
>>> a new
>>> > > > > > > > BigIntVector.
>>> > > > > > > > >    There is a way to create a BigIntVector and I did
>>> find an
>>> > > > > example
>>> > > > > > of
>>> > > > > > > > >    creating a new VarCharVector but I can't do that
>>> using the
>>> > > > drill
>>> > > > > > jar
>>> > > > > > > > > files
>>> > > > > > > > >    from 1.0. The
>>> org.apache.drill.common.types.TypeProtos and
>>> > > > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
>>> > > classes
>>> > > > > do
>>> > > > > > > not
>>> > > > > > > > >    appear to be accessible from the drill jar files.
>>> > > > > > > > >    2. What is the best way to close out a UDF in the
>>> event it
>>> > > > > > generates
>>> > > > > > > > an
>>> > > > > > > > >    exception? Are there specific steps one should follow
>>> to
>>> > > make
>>> > > > a
>>> > > > > > > clean
>>> > > > > > > > > exit
>>> > > > > > > > >    in a catch block that are beneficial to Drill?
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Well... Converting from string to integers anyway... To many 4th of July
Hot Dogs. going into nitrate overload. :)

I am pulling an array of string values from json data. The string values
are actually integers. I am converting to integers and summing each array
entry to the final tally.

On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <jb...@maprtech.com> wrote:

> Ted,
>
> Yes, I started out just getting a basic count to work. I am trying to keep
> the workflow as close to a basic user as possible. As such, I am building
> and using the MapR Apache Drill sandbox to test.
>
>
>    1. Always look at the drillbits.log file to see if drill had any
>    issues loading your UDF. That was where I learned that all workspace values
>    needed to be holders
>       -
>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
>       function class
>       com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, field
>       xList. Aggregate function 'MyLinearRegression1' workspace variable 'xList'
>       is of type 'interface
>       org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>       Please change it to Holder type.
>    2. Error messages:
>       - If you get an error in this format it means that Drill can not
>       find your function so it probably didn't load it. back to step 1:
>          -
>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
>          match found for function signature MyFunctionName(<ANY>)
>       - If you get an error in this format it means that the function is
>       there but Drill could not find a signature that matched the param types or
>       param numbers you were passing it. The exact wording will change but
>       the Missing function implementation is the key phrase to look for:
>          -
>          - Error: SYSTEM ERROR:
>          org.apache.drill.exec.exception.SchemaChangeException: Failure while trying
>          to materialize incoming schema.  Errors:
>          - Error in expression at index -1.  Error: Missing function
>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full expression: --UNKNOWN
>          EXPRESSION--
>       3. In your function definition for aggregate functions you need to
>    set null processing to internal and your isRandom to false. Example below:
>       -
>       - @FunctionTemplate(name = "MyFunctionName", scope =
>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>       isBinaryCommutative = false, costCategory =
>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>
> Below is an example from the Apache Drill tutorial data sets contained in
> the MapR Apache Drill sandbox. I am pulling an array if string values from
> json data. The string values are actually integers. I am converting to
> string and summing each array entry to the final tally. This in no way
> represents what this data was for but it did become a handy way for me to
> peck out the "correct" way to build an aggregation UDF function
>
> @FunctionTemplate(name = "MyArraySum", scope =
> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
> isBinaryCommutative = false, costCategory =
> FunctionTemplate.FunctionCostCategory.COMPLEX)
> public static class MyArraySum implements DrillAggFunc {
>
> @Param RepeatedVarCharHolder listToSearch;
> @Workspace NullableBigIntHolder count;
> @Workspace NullableBigIntHolder sum;
> @Workspace NullableVarCharHolder vc;
> @Output BigIntHolder out;
>
> @Override
> public void setup() {
> count.value=0;
> sum.value = 0;
> }
>
> @Override
> public void add() {
> int c = listToSearch.end - listToSearch.start;
> int val = 0;
> try {
> for(int i=0; i<c; i++){
> listToSearch.vector.getAccessor().get(i, vc);
> String inputStr =
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
> vc.end, vc.buffer);
> val = Integer.parseInt(inputStr);
> sum.value = sum.value + val;
> }
> } catch (Exception e) {
> val = 0;
> }
> count.value = count.value + 1;
> }
>
> Example select statement:
> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
>
> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Jim,
>>
>> I think that you may be having trouble with aggregators in general.
>>
>> Have you been able to build *any* aggregator of anything?  I haven't.
>>
>> When I try to build an aggregator of int's or doubles, I get a very
>> persistent problem with Drill even seeing my aggregates:
>>
>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>> cp.`employee.json`;*
>>
>> Jul 04, 2015 4:19:35 PM
>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>
>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>> found for function signature sum_int(<ANY>)
>>
>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException <init>
>>
>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>> column 8 to line 1, column 27: No match found for function signature
>> sum_int(<ANY>)
>>
>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match
>> found for function signature sum_int(<ANY>)*
>>
>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
>> <http://10.0.1.2:31010>] (state=,code=0)*
>>
>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
>> cp.`employee.json`*;
>>
>> Jul 04, 2015 4:19:45 PM
>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>
>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>> found for function signature sum_int(<NUMERIC>)
>>
>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException <init>
>>
>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>> column 8 to line 1, column 40: No match found for function signature
>> sum_int(<NUMERIC>)
>>
>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match
>> found for function signature sum_int(<NUMERIC>)*
>>
>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
>> <http://10.0.1.2:31010>] (state=,code=0)*
>>
>> 0: jdbc:drill:zk=local>
>>
>>
>> It looks like there is some undocumented subtlety about how to register an
>> aggregator.
>>
>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com> wrote:
>>
>> > I'm working on the same thing. I want to aggregate a list of values. It
>> has
>> > been a search and guess game for the most part. I'm still stuck in the
>> > process of getting the values all into a list. The writers look
>> interesting
>> > but for aggregation functions  it looks like the input is the param and
>> > output objects can't hold the aggregations steps. The Workspace is where
>> > that happens. If I try and use a Writer in a workspace it won't load and
>> > tells me to change it to Holders which was why I was using them to start
>> > with. Maybe I'm missing the architecture of the agg function. It looked
>> > like it was....
>> >
>> > @Param comes in -> initialize @Workspace vars in setup -> process data
>> > through @Workspace vars in add -> finalize @Output in output.
>> >
>> > So I'm back to trying to figure out how to create a
>> RepeatedBigIntHolder or
>> > a RepeatedVarCharHolder...
>> >
>> >
>> >
>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> > > I am working on trying to build any kind of list constructing
>> aggregator
>> > > and having absolute fits.
>> > >
>> > > To simplify life, I decided to just build a generic list builder that
>> is
>> > a
>> > > scalar function that returns a list containing its argument.  Thus
>> > zoop(3)
>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>> > >
>> > > The ComplexWriter looks like the place to go. As usual, the complete
>> lack
>> > > of comments in most of Drill makes this very hard since I have to
>> guess
>> > > what works and what doesn't.
>> > >
>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>> method.  I
>> > > used this in zip and it works nicely to construct lists for output.  I
>> > note
>> > > that the resulting ListWriter has a method copyReader(FieldReader
>> var1)
>> > > which looks really good.
>> > >
>> > > Unfortunately, the only implementation of copyReader() is in
>> > > AbstractFieldWriter and it looks this:
>> > >
>> > > public void copyReader(FieldReader reader) {
>> > >     this.fail("Copy FieldReader");
>> > > }
>> > >
>> > > I would like to formally say at this point "WTF"?
>> > >
>> > > In digging in further, I see other methods that look handy like
>> > >
>> > > public void write(IntHolder holder) {
>> > >     this.fail("Int");
>> > > }
>> > >
>> > > And then in looking at implementations, it looks like there is a
>> > > combinatorial explosion because every type seems to need a write
>> method
>> > for
>> > > every other type.
>> > >
>> > > What is the thought here?  How can I copy an arbitrary value into a
>> list?
>> > >
>> > > My next thought was to build code that dispatches on type.  There is a
>> > > method called getType() on the FieldReader.  Unfortunately, that
>> drives
>> > > into code generated by protoc and I see no way to dispatch on the
>> type of
>> > > an incoming value.
>> > >
>> > >
>> > > How is this supposed to work?
>> > >
>> > >
>> > >
>> > >
>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com>
>> > wrote:
>> > >
>> > > > For a detailed example on using ComplexWriter interface you can
>> take a
>> > > look
>> > > > at the Mappify
>> > > > <
>> > > >
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>> > > > >
>> > > > (kvgen) function. The function itself is very simple however it
>> makes
>> > use
>> > > > of the utility methods in MappifyUtility
>> > > > <
>> > > >
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>> > > > >
>> > > > and MapUtility
>> > > > <
>> > > >
>> > >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>> > > > >
>> > > > which perform most of the work.
>> > > >
>> > > > Currently we don't have a generic infrastructure to handle errors
>> > coming
>> > > > out of functions. However there is UserException, which when raised
>> > will
>> > > > make sure that Drill does not gobble up the error message in that
>> > > > exception. So you can probably throw a UserException with the
>> failing
>> > > input
>> > > > in your function to make sure it propagates to the user.
>> > > >
>> > > > Thanks
>> > > > Mehant
>> > > >
>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org>
>> > > wrote:
>> > > >
>> > > > > *Holders are for both input and output.  You can also use
>> > CompleWriter
>> > > > for
>> > > > > output and FieldReader for input if you want to write or read a
>> > complex
>> > > > > value.
>> > > > >
>> > > > > I don't think we've provided a really clean way to construct a
>> > > > > Repeated*Holder for output purposes.  You can probably do it by
>> > > reaching
>> > > > > into a bunch of internal interfaces in Drill.  However, I would
>> > > recommend
>> > > > > using the ComplexWriter output pattern for now.  This will be a
>> > little
>> > > > less
>> > > > > efficient but substantially less brittle.  I suggest you open up a
>> > jira
>> > > > for
>> > > > > using a Repeated*Holder as an output.
>> > > > >
>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > Holders are for input, I think.
>> > > > > >
>> > > > > > Try the different kinds of writers.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jbates@maprtech.com
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > Using a repeatedholder as a @param I've got working. I was
>> > working
>> > > > on a
>> > > > > > > custom aggregator function using DrillAggFunc. In this I can
>> do
>> > > > simple
>> > > > > > > things but If I want to build a list values and do something
>> with
>> > > it
>> > > > in
>> > > > > > the
>> > > > > > > final output method I think I need to use RepeatedHolders in
>> the
>> > > > > > > @Workspace. To do that I need to create a new one in the setup
>> > > > method.
>> > > > > I
>> > > > > > > can't get one built. They all require a BufferAllocator to be
>> > > passed
>> > > > in
>> > > > > > to
>> > > > > > > build it. I have not found a way to get an allocator yet. Any
>> > > > > > suggestions?
>> > > > > > >
>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
>> > ted.dunning@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > If you look at the zip function in
>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you
>> can
>> > > have
>> > > > an
>> > > > > > > > example of building a structure.
>> > > > > > > >
>> > > > > > > > The basic idea is that your output is denoted as
>> > > > > > > >
>> > > > > > > >         @Output
>> > > > > > > >         BaseWriter.ComplexWriter writer;
>> > > > > > > >
>> > > > > > > > The pattern for building a list of lists of integers is like
>> > > this:
>> > > > > > > >
>> > > > > > > >         writer.setValueCount(n);
>> > > > > > > >         ...
>> > > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
>> > > > > > > >         outer.start(); // [ outer list
>> > > > > > > >         ...
>> > > > > > > >         // for each inner list
>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
>> > > > > > > >             inner.start();
>> > > > > > > >             // for each inner list element
>> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
>> > > > > > > >             }
>> > > > > > > >             inner.end();   // ] inner list
>> > > > > > > >         }
>> > > > > > > >         outer.end(); // ] outer list
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
>> > jbates@maprtech.com>
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I have working aggregation and simple UDFs. I've been
>> trying
>> > to
>> > > > > > > document
>> > > > > > > > > and understand each of the options available in a Drill
>> UDF.
>> > > > > > > > Understanding
>> > > > > > > > > the different FunctionScope's, the ones that are allowed,
>> the
>> > > > ones
>> > > > > > that
>> > > > > > > > are
>> > > > > > > > > not. The impact of different cost categories. The
>> different
>> > > > steps
>> > > > > > > needed
>> > > > > > > > > to understand handling any of the supported data types
>> and
>> > > > > > structures
>> > > > > > > in
>> > > > > > > > > drill.
>> > > > > > > > >
>> > > > > > > > > Here are a few of my current road blocks. Any pointers
>> would
>> > be
>> > > > > > greatly
>> > > > > > > > > appreciated.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >    1. I've been trying to understand how to correctly use
>> > > > > > > RepeatedHolders
>> > > > > > > > >    of whatever type. For this discussion lets start with a
>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the best
>> > way
>> > > to
>> > > > > > > create
>> > > > > > > > a
>> > > > > > > > > new
>> > > > > > > > >    one. I have not figured out where in the existing drill
>> > code
>> > > > > > someone
>> > > > > > > > > does
>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
>> > object
>> > > > is
>> > > > > is
>> > > > > > > > null
>> > > > > > > > > to
>> > > > > > > > >    start with. I created a new one in the startup section
>> of
>> > > the
>> > > > > udf
>> > > > > > > but
>> > > > > > > > > the
>> > > > > > > > >    vector was null. I can find no reference in creating a
>> new
>> > > > > > > > BigIntVector.
>> > > > > > > > >    There is a way to create a BigIntVector and I did find
>> an
>> > > > > example
>> > > > > > of
>> > > > > > > > >    creating a new VarCharVector but I can't do that using
>> the
>> > > > drill
>> > > > > > jar
>> > > > > > > > > files
>> > > > > > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos
>> and
>> > > > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
>> > > classes
>> > > > > do
>> > > > > > > not
>> > > > > > > > >    appear to be accessible from the drill jar files.
>> > > > > > > > >    2. What is the best way to close out a UDF in the
>> event it
>> > > > > > generates
>> > > > > > > > an
>> > > > > > > > >    exception? Are there specific steps one should follow
>> to
>> > > make
>> > > > a
>> > > > > > > clean
>> > > > > > > > > exit
>> > > > > > > > >    in a catch block that are beneficial to Drill?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Ted,

Yes, I started out just getting a basic count to work. I am trying to keep
the workflow as close to a basic user as possible. As such, I am building
and using the MapR Apache Drill sandbox to test.


   1. Always look at the drillbits.log file to see if drill had any issues
   loading your UDF. That was where I learned that all workspace values needed
   to be holders
      -
      - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
      function class
      com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1,
field
      xList. Aggregate function 'MyLinearRegression1' workspace
variable 'xList'
      is of type 'interface
      org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
      Please change it to Holder type.
   2. Error messages:
      - If you get an error in this format it means that Drill can not find
      your function so it probably didn't load it. back to step 1:
         -
         - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
         match found for function signature MyFunctionName(<ANY>)
      - If you get an error in this format it means that the function is
      there but Drill could not find a signature that matched the
param types or
      param numbers you were passing it. The exact wording will change but
      the Missing function implementation is the key phrase to look for:
         -
         - Error: SYSTEM ERROR:
         org.apache.drill.exec.exception.SchemaChangeException:
Failure while trying
         to materialize incoming schema.  Errors:
         - Error in expression at index -1.  Error: Missing function
         implementation: [castBIGINT(VARCHAR-REPEATED)].  Full
expression: --UNKNOWN
         EXPRESSION--
      3. In your function definition for aggregate functions you need to
   set null processing to internal and your isRandom to false. Example below:
      -
      - @FunctionTemplate(name = "MyFunctionName", scope =
      FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
      FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
      isBinaryCommutative = false, costCategory =
      FunctionTemplate.FunctionCostCategory.COMPLEX)

Below is an example from the Apache Drill tutorial data sets contained in
the MapR Apache Drill sandbox. I am pulling an array if string values from
json data. The string values are actually integers. I am converting to
string and summing each array entry to the final tally. This in no way
represents what this data was for but it did become a handy way for me to
peck out the "correct" way to build an aggregation UDF function

@FunctionTemplate(name = "MyArraySum", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
isBinaryCommutative = false, costCategory =
FunctionTemplate.FunctionCostCategory.COMPLEX)
public static class MyArraySum implements DrillAggFunc {

@Param RepeatedVarCharHolder listToSearch;
@Workspace NullableBigIntHolder count;
@Workspace NullableBigIntHolder sum;
@Workspace NullableVarCharHolder vc;
@Output BigIntHolder out;

@Override
public void setup() {
count.value=0;
sum.value = 0;
}

@Override
public void add() {
int c = listToSearch.end - listToSearch.start;
int val = 0;
try {
for(int i=0; i<c; i++){
listToSearch.vector.getAccessor().get(i, vc);
String inputStr =
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
vc.end, vc.buffer);
val = Integer.parseInt(inputStr);
sum.value = sum.value + val;
}
} catch (Exception e) {
val = 0;
}
count.value = count.value + 1;
}

Example select statement:
SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as my_arrays
FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);

On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <te...@gmail.com> wrote:

> Jim,
>
> I think that you may be having trouble with aggregators in general.
>
> Have you been able to build *any* aggregator of anything?  I haven't.
>
> When I try to build an aggregator of int's or doubles, I get a very
> persistent problem with Drill even seeing my aggregates:
>
> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
> cp.`employee.json`;*
>
> Jul 04, 2015 4:19:35 PM
> org.apache.calcite.sql.validate.SqlValidatorException <init>
>
> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
> found for function signature sum_int(<ANY>)
>
> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException <init>
>
> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
> column 8 to line 1, column 27: No match found for function signature
> sum_int(<ANY>)
>
> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match
> found for function signature sum_int(<ANY>)*
>
> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
> <http://10.0.1.2:31010>] (state=,code=0)*
>
> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
> cp.`employee.json`*;
>
> Jul 04, 2015 4:19:45 PM
> org.apache.calcite.sql.validate.SqlValidatorException <init>
>
> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
> found for function signature sum_int(<NUMERIC>)
>
> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException <init>
>
> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
> column 8 to line 1, column 40: No match found for function signature
> sum_int(<NUMERIC>)
>
> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match
> found for function signature sum_int(<NUMERIC>)*
>
> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
> <http://10.0.1.2:31010>] (state=,code=0)*
>
> 0: jdbc:drill:zk=local>
>
>
> It looks like there is some undocumented subtlety about how to register an
> aggregator.
>
> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com> wrote:
>
> > I'm working on the same thing. I want to aggregate a list of values. It
> has
> > been a search and guess game for the most part. I'm still stuck in the
> > process of getting the values all into a list. The writers look
> interesting
> > but for aggregation functions  it looks like the input is the param and
> > output objects can't hold the aggregations steps. The Workspace is where
> > that happens. If I try and use a Writer in a workspace it won't load and
> > tells me to change it to Holders which was why I was using them to start
> > with. Maybe I'm missing the architecture of the agg function. It looked
> > like it was....
> >
> > @Param comes in -> initialize @Workspace vars in setup -> process data
> > through @Workspace vars in add -> finalize @Output in output.
> >
> > So I'm back to trying to figure out how to create a RepeatedBigIntHolder
> or
> > a RepeatedVarCharHolder...
> >
> >
> >
> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > I am working on trying to build any kind of list constructing
> aggregator
> > > and having absolute fits.
> > >
> > > To simplify life, I decided to just build a generic list builder that
> is
> > a
> > > scalar function that returns a list containing its argument.  Thus
> > zoop(3)
> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> > >
> > > The ComplexWriter looks like the place to go. As usual, the complete
> lack
> > > of comments in most of Drill makes this very hard since I have to guess
> > > what works and what doesn't.
> > >
> > > In my code, I note that ComplexWriter has a nice rootAsList() method.
> I
> > > used this in zip and it works nicely to construct lists for output.  I
> > note
> > > that the resulting ListWriter has a method copyReader(FieldReader var1)
> > > which looks really good.
> > >
> > > Unfortunately, the only implementation of copyReader() is in
> > > AbstractFieldWriter and it looks this:
> > >
> > > public void copyReader(FieldReader reader) {
> > >     this.fail("Copy FieldReader");
> > > }
> > >
> > > I would like to formally say at this point "WTF"?
> > >
> > > In digging in further, I see other methods that look handy like
> > >
> > > public void write(IntHolder holder) {
> > >     this.fail("Int");
> > > }
> > >
> > > And then in looking at implementations, it looks like there is a
> > > combinatorial explosion because every type seems to need a write method
> > for
> > > every other type.
> > >
> > > What is the thought here?  How can I copy an arbitrary value into a
> list?
> > >
> > > My next thought was to build code that dispatches on type.  There is a
> > > method called getType() on the FieldReader.  Unfortunately, that drives
> > > into code generated by protoc and I see no way to dispatch on the type
> of
> > > an incoming value.
> > >
> > >
> > > How is this supposed to work?
> > >
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com>
> > wrote:
> > >
> > > > For a detailed example on using ComplexWriter interface you can take
> a
> > > look
> > > > at the Mappify
> > > > <
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > > >
> > > > (kvgen) function. The function itself is very simple however it makes
> > use
> > > > of the utility methods in MappifyUtility
> > > > <
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > > >
> > > > and MapUtility
> > > > <
> > > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > > >
> > > > which perform most of the work.
> > > >
> > > > Currently we don't have a generic infrastructure to handle errors
> > coming
> > > > out of functions. However there is UserException, which when raised
> > will
> > > > make sure that Drill does not gobble up the error message in that
> > > > exception. So you can probably throw a UserException with the failing
> > > input
> > > > in your function to make sure it propagates to the user.
> > > >
> > > > Thanks
> > > > Mehant
> > > >
> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > *Holders are for both input and output.  You can also use
> > CompleWriter
> > > > for
> > > > > output and FieldReader for input if you want to write or read a
> > complex
> > > > > value.
> > > > >
> > > > > I don't think we've provided a really clean way to construct a
> > > > > Repeated*Holder for output purposes.  You can probably do it by
> > > reaching
> > > > > into a bunch of internal interfaces in Drill.  However, I would
> > > recommend
> > > > > using the ComplexWriter output pattern for now.  This will be a
> > little
> > > > less
> > > > > efficient but substantially less brittle.  I suggest you open up a
> > jira
> > > > for
> > > > > using a Repeated*Holder as an output.
> > > > >
> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Holders are for input, I think.
> > > > > >
> > > > > > Try the different kinds of writers.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com>
> > > > wrote:
> > > > > >
> > > > > > > Using a repeatedholder as a @param I've got working. I was
> > working
> > > > on a
> > > > > > > custom aggregator function using DrillAggFunc. In this I can do
> > > > simple
> > > > > > > things but If I want to build a list values and do something
> with
> > > it
> > > > in
> > > > > > the
> > > > > > > final output method I think I need to use RepeatedHolders in
> the
> > > > > > > @Workspace. To do that I need to create a new one in the setup
> > > > method.
> > > > > I
> > > > > > > can't get one built. They all require a BufferAllocator to be
> > > passed
> > > > in
> > > > > > to
> > > > > > > build it. I have not found a way to get an allocator yet. Any
> > > > > > suggestions?
> > > > > > >
> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> > ted.dunning@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > If you look at the zip function in
> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you can
> > > have
> > > > an
> > > > > > > > example of building a structure.
> > > > > > > >
> > > > > > > > The basic idea is that your output is denoted as
> > > > > > > >
> > > > > > > >         @Output
> > > > > > > >         BaseWriter.ComplexWriter writer;
> > > > > > > >
> > > > > > > > The pattern for building a list of lists of integers is like
> > > this:
> > > > > > > >
> > > > > > > >         writer.setValueCount(n);
> > > > > > > >         ...
> > > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > > > > > > >         outer.start(); // [ outer list
> > > > > > > >         ...
> > > > > > > >         // for each inner list
> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
> > > > > > > >             inner.start();
> > > > > > > >             // for each inner list element
> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
> > > > > > > >             }
> > > > > > > >             inner.end();   // ] inner list
> > > > > > > >         }
> > > > > > > >         outer.end(); // ] outer list
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> > jbates@maprtech.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I have working aggregation and simple UDFs. I've been
> trying
> > to
> > > > > > > document
> > > > > > > > > and understand each of the options available in a Drill
> UDF.
> > > > > > > > Understanding
> > > > > > > > > the different FunctionScope's, the ones that are allowed,
> the
> > > > ones
> > > > > > that
> > > > > > > > are
> > > > > > > > > not. The impact of different cost categories. The different
> > > > steps
> > > > > > > needed
> > > > > > > > > to understand handling any of the supported data types  and
> > > > > > structures
> > > > > > > in
> > > > > > > > > drill.
> > > > > > > > >
> > > > > > > > > Here are a few of my current road blocks. Any pointers
> would
> > be
> > > > > > greatly
> > > > > > > > > appreciated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    1. I've been trying to understand how to correctly use
> > > > > > > RepeatedHolders
> > > > > > > > >    of whatever type. For this discussion lets start with a
> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the best
> > way
> > > to
> > > > > > > create
> > > > > > > > a
> > > > > > > > > new
> > > > > > > > >    one. I have not figured out where in the existing drill
> > code
> > > > > > someone
> > > > > > > > > does
> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
> > object
> > > > is
> > > > > is
> > > > > > > > null
> > > > > > > > > to
> > > > > > > > >    start with. I created a new one in the startup section
> of
> > > the
> > > > > udf
> > > > > > > but
> > > > > > > > > the
> > > > > > > > >    vector was null. I can find no reference in creating a
> new
> > > > > > > > BigIntVector.
> > > > > > > > >    There is a way to create a BigIntVector and I did find
> an
> > > > > example
> > > > > > of
> > > > > > > > >    creating a new VarCharVector but I can't do that using
> the
> > > > drill
> > > > > > jar
> > > > > > > > > files
> > > > > > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos
> and
> > > > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
> > > classes
> > > > > do
> > > > > > > not
> > > > > > > > >    appear to be accessible from the drill jar files.
> > > > > > > > >    2. What is the best way to close out a UDF in the event
> it
> > > > > > generates
> > > > > > > > an
> > > > > > > > >    exception? Are there specific steps one should follow to
> > > make
> > > > a
> > > > > > > clean
> > > > > > > > > exit
> > > > > > > > >    in a catch block that are beneficial to Drill?
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
Jim,

I think that you may be having trouble with aggregators in general.

Have you been able to build *any* aggregator of anything?  I haven't.

When I try to build an aggregator of int's or doubles, I get a very
persistent problem with Drill even seeing my aggregates:

0: jdbc:drill:zk=local> *select sum_int(employee_id) from
cp.`employee.json`;*

Jul 04, 2015 4:19:35 PM
org.apache.calcite.sql.validate.SqlValidatorException <init>

SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
found for function signature sum_int(<ANY>)

Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException <init>

SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
column 8 to line 1, column 27: No match found for function signature
sum_int(<ANY>)

*Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match
found for function signature sum_int(<ANY>)*

*[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
<http://10.0.1.2:31010>] (state=,code=0)*

0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
cp.`employee.json`*;

Jul 04, 2015 4:19:45 PM
org.apache.calcite.sql.validate.SqlValidatorException <init>

SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
found for function signature sum_int(<NUMERIC>)

Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException <init>

SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
column 8 to line 1, column 40: No match found for function signature
sum_int(<NUMERIC>)

*Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match
found for function signature sum_int(<NUMERIC>)*

*[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
<http://10.0.1.2:31010>] (state=,code=0)*

0: jdbc:drill:zk=local>


It looks like there is some undocumented subtlety about how to register an
aggregator.

On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <jb...@maprtech.com> wrote:

> I'm working on the same thing. I want to aggregate a list of values. It has
> been a search and guess game for the most part. I'm still stuck in the
> process of getting the values all into a list. The writers look interesting
> but for aggregation functions  it looks like the input is the param and
> output objects can't hold the aggregations steps. The Workspace is where
> that happens. If I try and use a Writer in a workspace it won't load and
> tells me to change it to Holders which was why I was using them to start
> with. Maybe I'm missing the architecture of the agg function. It looked
> like it was....
>
> @Param comes in -> initialize @Workspace vars in setup -> process data
> through @Workspace vars in add -> finalize @Output in output.
>
> So I'm back to trying to figure out how to create a RepeatedBigIntHolder or
> a RepeatedVarCharHolder...
>
>
>
> On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > I am working on trying to build any kind of list constructing aggregator
> > and having absolute fits.
> >
> > To simplify life, I decided to just build a generic list builder that is
> a
> > scalar function that returns a list containing its argument.  Thus
> zoop(3)
> > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
> >
> > The ComplexWriter looks like the place to go. As usual, the complete lack
> > of comments in most of Drill makes this very hard since I have to guess
> > what works and what doesn't.
> >
> > In my code, I note that ComplexWriter has a nice rootAsList() method.  I
> > used this in zip and it works nicely to construct lists for output.  I
> note
> > that the resulting ListWriter has a method copyReader(FieldReader var1)
> > which looks really good.
> >
> > Unfortunately, the only implementation of copyReader() is in
> > AbstractFieldWriter and it looks this:
> >
> > public void copyReader(FieldReader reader) {
> >     this.fail("Copy FieldReader");
> > }
> >
> > I would like to formally say at this point "WTF"?
> >
> > In digging in further, I see other methods that look handy like
> >
> > public void write(IntHolder holder) {
> >     this.fail("Int");
> > }
> >
> > And then in looking at implementations, it looks like there is a
> > combinatorial explosion because every type seems to need a write method
> for
> > every other type.
> >
> > What is the thought here?  How can I copy an arbitrary value into a list?
> >
> > My next thought was to build code that dispatches on type.  There is a
> > method called getType() on the FieldReader.  Unfortunately, that drives
> > into code generated by protoc and I see no way to dispatch on the type of
> > an incoming value.
> >
> >
> > How is this supposed to work?
> >
> >
> >
> >
> > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com>
> wrote:
> >
> > > For a detailed example on using ComplexWriter interface you can take a
> > look
> > > at the Mappify
> > > <
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > > >
> > > (kvgen) function. The function itself is very simple however it makes
> use
> > > of the utility methods in MappifyUtility
> > > <
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > > >
> > > and MapUtility
> > > <
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > > >
> > > which perform most of the work.
> > >
> > > Currently we don't have a generic infrastructure to handle errors
> coming
> > > out of functions. However there is UserException, which when raised
> will
> > > make sure that Drill does not gobble up the error message in that
> > > exception. So you can probably throw a UserException with the failing
> > input
> > > in your function to make sure it propagates to the user.
> > >
> > > Thanks
> > > Mehant
> > >
> > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > *Holders are for both input and output.  You can also use
> CompleWriter
> > > for
> > > > output and FieldReader for input if you want to write or read a
> complex
> > > > value.
> > > >
> > > > I don't think we've provided a really clean way to construct a
> > > > Repeated*Holder for output purposes.  You can probably do it by
> > reaching
> > > > into a bunch of internal interfaces in Drill.  However, I would
> > recommend
> > > > using the ComplexWriter output pattern for now.  This will be a
> little
> > > less
> > > > efficient but substantially less brittle.  I suggest you open up a
> jira
> > > for
> > > > using a Repeated*Holder as an output.
> > > >
> > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > > > Holders are for input, I think.
> > > > >
> > > > > Try the different kinds of writers.
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > > >
> > > > > > Using a repeatedholder as a @param I've got working. I was
> working
> > > on a
> > > > > > custom aggregator function using DrillAggFunc. In this I can do
> > > simple
> > > > > > things but If I want to build a list values and do something with
> > it
> > > in
> > > > > the
> > > > > > final output method I think I need to use RepeatedHolders in the
> > > > > > @Workspace. To do that I need to create a new one in the setup
> > > method.
> > > > I
> > > > > > can't get one built. They all require a BufferAllocator to be
> > passed
> > > in
> > > > > to
> > > > > > build it. I have not found a way to get an allocator yet. Any
> > > > > suggestions?
> > > > > >
> > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
> ted.dunning@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > If you look at the zip function in
> > > > > > > https://github.com/mapr-demos/simple-drill-functions you can
> > have
> > > an
> > > > > > > example of building a structure.
> > > > > > >
> > > > > > > The basic idea is that your output is denoted as
> > > > > > >
> > > > > > >         @Output
> > > > > > >         BaseWriter.ComplexWriter writer;
> > > > > > >
> > > > > > > The pattern for building a list of lists of integers is like
> > this:
> > > > > > >
> > > > > > >         writer.setValueCount(n);
> > > > > > >         ...
> > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > > > > > >         outer.start(); // [ outer list
> > > > > > >         ...
> > > > > > >         // for each inner list
> > > > > > >             BaseWriter.ListWriter inner = outer.list();
> > > > > > >             inner.start();
> > > > > > >             // for each inner list element
> > > > > > >                 inner.integer().writeInt(accessor.get(i));
> > > > > > >             }
> > > > > > >             inner.end();   // ] inner list
> > > > > > >         }
> > > > > > >         outer.end(); // ] outer list
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
> jbates@maprtech.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > I have working aggregation and simple UDFs. I've been trying
> to
> > > > > > document
> > > > > > > > and understand each of the options available in a Drill UDF.
> > > > > > > Understanding
> > > > > > > > the different FunctionScope's, the ones that are allowed, the
> > > ones
> > > > > that
> > > > > > > are
> > > > > > > > not. The impact of different cost categories. The different
> > > steps
> > > > > > needed
> > > > > > > > to understand handling any of the supported data types  and
> > > > > structures
> > > > > > in
> > > > > > > > drill.
> > > > > > > >
> > > > > > > > Here are a few of my current road blocks. Any pointers would
> be
> > > > > greatly
> > > > > > > > appreciated.
> > > > > > > >
> > > > > > > >
> > > > > > > >    1. I've been trying to understand how to correctly use
> > > > > > RepeatedHolders
> > > > > > > >    of whatever type. For this discussion lets start with a
> > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the best
> way
> > to
> > > > > > create
> > > > > > > a
> > > > > > > > new
> > > > > > > >    one. I have not figured out where in the existing drill
> code
> > > > > someone
> > > > > > > > does
> > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
> object
> > > is
> > > > is
> > > > > > > null
> > > > > > > > to
> > > > > > > >    start with. I created a new one in the startup section of
> > the
> > > > udf
> > > > > > but
> > > > > > > > the
> > > > > > > >    vector was null. I can find no reference in creating a new
> > > > > > > BigIntVector.
> > > > > > > >    There is a way to create a BigIntVector and I did find an
> > > > example
> > > > > of
> > > > > > > >    creating a new VarCharVector but I can't do that using the
> > > drill
> > > > > jar
> > > > > > > > files
> > > > > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
> > classes
> > > > do
> > > > > > not
> > > > > > > >    appear to be accessible from the drill jar files.
> > > > > > > >    2. What is the best way to close out a UDF in the event it
> > > > > generates
> > > > > > > an
> > > > > > > >    exception? Are there specific steps one should follow to
> > make
> > > a
> > > > > > clean
> > > > > > > > exit
> > > > > > > >    in a catch block that are beneficial to Drill?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
I'm working on the same thing. I want to aggregate a list of values. It has
been a search and guess game for the most part. I'm still stuck in the
process of getting the values all into a list. The writers look interesting
but for aggregation functions  it looks like the input is the param and
output objects can't hold the aggregations steps. The Workspace is where
that happens. If I try and use a Writer in a workspace it won't load and
tells me to change it to Holders which was why I was using them to start
with. Maybe I'm missing the architecture of the agg function. It looked
like it was....

@Param comes in -> initialize @Workspace vars in setup -> process data
through @Workspace vars in add -> finalize @Output in output.

So I'm back to trying to figure out how to create a RepeatedBigIntHolder or
a RepeatedVarCharHolder...



On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <te...@gmail.com> wrote:

> I am working on trying to build any kind of list constructing aggregator
> and having absolute fits.
>
> To simplify life, I decided to just build a generic list builder that is a
> scalar function that returns a list containing its argument.  Thus zoop(3)
> => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>
> The ComplexWriter looks like the place to go. As usual, the complete lack
> of comments in most of Drill makes this very hard since I have to guess
> what works and what doesn't.
>
> In my code, I note that ComplexWriter has a nice rootAsList() method.  I
> used this in zip and it works nicely to construct lists for output.  I note
> that the resulting ListWriter has a method copyReader(FieldReader var1)
> which looks really good.
>
> Unfortunately, the only implementation of copyReader() is in
> AbstractFieldWriter and it looks this:
>
> public void copyReader(FieldReader reader) {
>     this.fail("Copy FieldReader");
> }
>
> I would like to formally say at this point "WTF"?
>
> In digging in further, I see other methods that look handy like
>
> public void write(IntHolder holder) {
>     this.fail("Int");
> }
>
> And then in looking at implementations, it looks like there is a
> combinatorial explosion because every type seems to need a write method for
> every other type.
>
> What is the thought here?  How can I copy an arbitrary value into a list?
>
> My next thought was to build code that dispatches on type.  There is a
> method called getType() on the FieldReader.  Unfortunately, that drives
> into code generated by protoc and I see no way to dispatch on the type of
> an incoming value.
>
>
> How is this supposed to work?
>
>
>
>
> On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com> wrote:
>
> > For a detailed example on using ComplexWriter interface you can take a
> look
> > at the Mappify
> > <
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> > >
> > (kvgen) function. The function itself is very simple however it makes use
> > of the utility methods in MappifyUtility
> > <
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> > >
> > and MapUtility
> > <
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> > >
> > which perform most of the work.
> >
> > Currently we don't have a generic infrastructure to handle errors coming
> > out of functions. However there is UserException, which when raised will
> > make sure that Drill does not gobble up the error message in that
> > exception. So you can probably throw a UserException with the failing
> input
> > in your function to make sure it propagates to the user.
> >
> > Thanks
> > Mehant
> >
> > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > *Holders are for both input and output.  You can also use CompleWriter
> > for
> > > output and FieldReader for input if you want to write or read a complex
> > > value.
> > >
> > > I don't think we've provided a really clean way to construct a
> > > Repeated*Holder for output purposes.  You can probably do it by
> reaching
> > > into a bunch of internal interfaces in Drill.  However, I would
> recommend
> > > using the ComplexWriter output pattern for now.  This will be a little
> > less
> > > efficient but substantially less brittle.  I suggest you open up a jira
> > for
> > > using a Repeated*Holder as an output.
> > >
> > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > > Holders are for input, I think.
> > > >
> > > > Try the different kinds of writers.
> > > >
> > > >
> > > >
> > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > >
> > > > > Using a repeatedholder as a @param I've got working. I was working
> > on a
> > > > > custom aggregator function using DrillAggFunc. In this I can do
> > simple
> > > > > things but If I want to build a list values and do something with
> it
> > in
> > > > the
> > > > > final output method I think I need to use RepeatedHolders in the
> > > > > @Workspace. To do that I need to create a new one in the setup
> > method.
> > > I
> > > > > can't get one built. They all require a BufferAllocator to be
> passed
> > in
> > > > to
> > > > > build it. I have not found a way to get an allocator yet. Any
> > > > suggestions?
> > > > >
> > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > If you look at the zip function in
> > > > > > https://github.com/mapr-demos/simple-drill-functions you can
> have
> > an
> > > > > > example of building a structure.
> > > > > >
> > > > > > The basic idea is that your output is denoted as
> > > > > >
> > > > > >         @Output
> > > > > >         BaseWriter.ComplexWriter writer;
> > > > > >
> > > > > > The pattern for building a list of lists of integers is like
> this:
> > > > > >
> > > > > >         writer.setValueCount(n);
> > > > > >         ...
> > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > > > > >         outer.start(); // [ outer list
> > > > > >         ...
> > > > > >         // for each inner list
> > > > > >             BaseWriter.ListWriter inner = outer.list();
> > > > > >             inner.start();
> > > > > >             // for each inner list element
> > > > > >                 inner.integer().writeInt(accessor.get(i));
> > > > > >             }
> > > > > >             inner.end();   // ] inner list
> > > > > >         }
> > > > > >         outer.end(); // ] outer list
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com>
> > > > wrote:
> > > > > >
> > > > > > > I have working aggregation and simple UDFs. I've been trying to
> > > > > document
> > > > > > > and understand each of the options available in a Drill UDF.
> > > > > > Understanding
> > > > > > > the different FunctionScope's, the ones that are allowed, the
> > ones
> > > > that
> > > > > > are
> > > > > > > not. The impact of different cost categories. The different
> > steps
> > > > > needed
> > > > > > > to understand handling any of the supported data types  and
> > > > structures
> > > > > in
> > > > > > > drill.
> > > > > > >
> > > > > > > Here are a few of my current road blocks. Any pointers would be
> > > > greatly
> > > > > > > appreciated.
> > > > > > >
> > > > > > >
> > > > > > >    1. I've been trying to understand how to correctly use
> > > > > RepeatedHolders
> > > > > > >    of whatever type. For this discussion lets start with a
> > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the best way
> to
> > > > > create
> > > > > > a
> > > > > > > new
> > > > > > >    one. I have not figured out where in the existing drill code
> > > > someone
> > > > > > > does
> > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace object
> > is
> > > is
> > > > > > null
> > > > > > > to
> > > > > > >    start with. I created a new one in the startup section of
> the
> > > udf
> > > > > but
> > > > > > > the
> > > > > > >    vector was null. I can find no reference in creating a new
> > > > > > BigIntVector.
> > > > > > >    There is a way to create a BigIntVector and I did find an
> > > example
> > > > of
> > > > > > >    creating a new VarCharVector but I can't do that using the
> > drill
> > > > jar
> > > > > > > files
> > > > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
> classes
> > > do
> > > > > not
> > > > > > >    appear to be accessible from the drill jar files.
> > > > > > >    2. What is the best way to close out a UDF in the event it
> > > > generates
> > > > > > an
> > > > > > >    exception? Are there specific steps one should follow to
> make
> > a
> > > > > clean
> > > > > > > exit
> > > > > > >    in a catch block that are beneficial to Drill?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
I am working on trying to build any kind of list constructing aggregator
and having absolute fits.

To simplify life, I decided to just build a generic list builder that is a
scalar function that returns a list containing its argument.  Thus zoop(3)
=> [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].

The ComplexWriter looks like the place to go. As usual, the complete lack
of comments in most of Drill makes this very hard since I have to guess
what works and what doesn't.

In my code, I note that ComplexWriter has a nice rootAsList() method.  I
used this in zip and it works nicely to construct lists for output.  I note
that the resulting ListWriter has a method copyReader(FieldReader var1)
which looks really good.

Unfortunately, the only implementation of copyReader() is in
AbstractFieldWriter and it looks this:

public void copyReader(FieldReader reader) {
    this.fail("Copy FieldReader");
}

I would like to formally say at this point "WTF"?

In digging in further, I see other methods that look handy like

public void write(IntHolder holder) {
    this.fail("Int");
}

And then in looking at implementations, it looks like there is a
combinatorial explosion because every type seems to need a write method for
every other type.

What is the thought here?  How can I copy an arbitrary value into a list?

My next thought was to build code that dispatches on type.  There is a
method called getType() on the FieldReader.  Unfortunately, that drives
into code generated by protoc and I see no way to dispatch on the type of
an incoming value.


How is this supposed to work?




On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <ba...@gmail.com> wrote:

> For a detailed example on using ComplexWriter interface you can take a look
> at the Mappify
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
> >
> (kvgen) function. The function itself is very simple however it makes use
> of the utility methods in MappifyUtility
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
> >
> and MapUtility
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
> >
> which perform most of the work.
>
> Currently we don't have a generic infrastructure to handle errors coming
> out of functions. However there is UserException, which when raised will
> make sure that Drill does not gobble up the error message in that
> exception. So you can probably throw a UserException with the failing input
> in your function to make sure it propagates to the user.
>
> Thanks
> Mehant
>
> On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > *Holders are for both input and output.  You can also use CompleWriter
> for
> > output and FieldReader for input if you want to write or read a complex
> > value.
> >
> > I don't think we've provided a really clean way to construct a
> > Repeated*Holder for output purposes.  You can probably do it by reaching
> > into a bunch of internal interfaces in Drill.  However, I would recommend
> > using the ComplexWriter output pattern for now.  This will be a little
> less
> > efficient but substantially less brittle.  I suggest you open up a jira
> for
> > using a Repeated*Holder as an output.
> >
> > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > Holders are for input, I think.
> > >
> > > Try the different kinds of writers.
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com>
> wrote:
> > >
> > > > Using a repeatedholder as a @param I've got working. I was working
> on a
> > > > custom aggregator function using DrillAggFunc. In this I can do
> simple
> > > > things but If I want to build a list values and do something with it
> in
> > > the
> > > > final output method I think I need to use RepeatedHolders in the
> > > > @Workspace. To do that I need to create a new one in the setup
> method.
> > I
> > > > can't get one built. They all require a BufferAllocator to be passed
> in
> > > to
> > > > build it. I have not found a way to get an allocator yet. Any
> > > suggestions?
> > > >
> > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > > > If you look at the zip function in
> > > > > https://github.com/mapr-demos/simple-drill-functions you can have
> an
> > > > > example of building a structure.
> > > > >
> > > > > The basic idea is that your output is denoted as
> > > > >
> > > > >         @Output
> > > > >         BaseWriter.ComplexWriter writer;
> > > > >
> > > > > The pattern for building a list of lists of integers is like this:
> > > > >
> > > > >         writer.setValueCount(n);
> > > > >         ...
> > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > > > >         outer.start(); // [ outer list
> > > > >         ...
> > > > >         // for each inner list
> > > > >             BaseWriter.ListWriter inner = outer.list();
> > > > >             inner.start();
> > > > >             // for each inner list element
> > > > >                 inner.integer().writeInt(accessor.get(i));
> > > > >             }
> > > > >             inner.end();   // ] inner list
> > > > >         }
> > > > >         outer.end(); // ] outer list
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com>
> > > wrote:
> > > > >
> > > > > > I have working aggregation and simple UDFs. I've been trying to
> > > > document
> > > > > > and understand each of the options available in a Drill UDF.
> > > > > Understanding
> > > > > > the different FunctionScope's, the ones that are allowed, the
> ones
> > > that
> > > > > are
> > > > > > not. The impact of different cost categories. The different
> steps
> > > > needed
> > > > > > to understand handling any of the supported data types  and
> > > structures
> > > > in
> > > > > > drill.
> > > > > >
> > > > > > Here are a few of my current road blocks. Any pointers would be
> > > greatly
> > > > > > appreciated.
> > > > > >
> > > > > >
> > > > > >    1. I've been trying to understand how to correctly use
> > > > RepeatedHolders
> > > > > >    of whatever type. For this discussion lets start with a
> > > > > >    RepeatedBigIntHolder. I'm trying to figure out the best way to
> > > > create
> > > > > a
> > > > > > new
> > > > > >    one. I have not figured out where in the existing drill code
> > > someone
> > > > > > does
> > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace object
> is
> > is
> > > > > null
> > > > > > to
> > > > > >    start with. I created a new one in the startup section of the
> > udf
> > > > but
> > > > > > the
> > > > > >    vector was null. I can find no reference in creating a new
> > > > > BigIntVector.
> > > > > >    There is a way to create a BigIntVector and I did find an
> > example
> > > of
> > > > > >    creating a new VarCharVector but I can't do that using the
> drill
> > > jar
> > > > > > files
> > > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType classes
> > do
> > > > not
> > > > > >    appear to be accessible from the drill jar files.
> > > > > >    2. What is the best way to close out a UDF in the event it
> > > generates
> > > > > an
> > > > > >    exception? Are there specific steps one should follow to make
> a
> > > > clean
> > > > > > exit
> > > > > >    in a catch block that are beneficial to Drill?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by mehant baid <ba...@gmail.com>.
For a detailed example on using ComplexWriter interface you can take a look
at the Mappify
<https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java>
(kvgen) function. The function itself is very simple however it makes use
of the utility methods in MappifyUtility
<https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java>
and MapUtility
<https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java>
which perform most of the work.

Currently we don't have a generic infrastructure to handle errors coming
out of functions. However there is UserException, which when raised will
make sure that Drill does not gobble up the error message in that
exception. So you can probably throw a UserException with the failing input
in your function to make sure it propagates to the user.

Thanks
Mehant

On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <ja...@apache.org> wrote:

> *Holders are for both input and output.  You can also use CompleWriter for
> output and FieldReader for input if you want to write or read a complex
> value.
>
> I don't think we've provided a really clean way to construct a
> Repeated*Holder for output purposes.  You can probably do it by reaching
> into a bunch of internal interfaces in Drill.  However, I would recommend
> using the ComplexWriter output pattern for now.  This will be a little less
> efficient but substantially less brittle.  I suggest you open up a jira for
> using a Repeated*Holder as an output.
>
> On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Holders are for input, I think.
> >
> > Try the different kinds of writers.
> >
> >
> >
> > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com> wrote:
> >
> > > Using a repeatedholder as a @param I've got working. I was working on a
> > > custom aggregator function using DrillAggFunc. In this I can do simple
> > > things but If I want to build a list values and do something with it in
> > the
> > > final output method I think I need to use RepeatedHolders in the
> > > @Workspace. To do that I need to create a new one in the setup method.
> I
> > > can't get one built. They all require a BufferAllocator to be passed in
> > to
> > > build it. I have not found a way to get an allocator yet. Any
> > suggestions?
> > >
> > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > > If you look at the zip function in
> > > > https://github.com/mapr-demos/simple-drill-functions you can have an
> > > > example of building a structure.
> > > >
> > > > The basic idea is that your output is denoted as
> > > >
> > > >         @Output
> > > >         BaseWriter.ComplexWriter writer;
> > > >
> > > > The pattern for building a list of lists of integers is like this:
> > > >
> > > >         writer.setValueCount(n);
> > > >         ...
> > > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > > >         outer.start(); // [ outer list
> > > >         ...
> > > >         // for each inner list
> > > >             BaseWriter.ListWriter inner = outer.list();
> > > >             inner.start();
> > > >             // for each inner list element
> > > >                 inner.integer().writeInt(accessor.get(i));
> > > >             }
> > > >             inner.end();   // ] inner list
> > > >         }
> > > >         outer.end(); // ] outer list
> > > >
> > > >
> > > >
> > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com>
> > wrote:
> > > >
> > > > > I have working aggregation and simple UDFs. I've been trying to
> > > document
> > > > > and understand each of the options available in a Drill UDF.
> > > > Understanding
> > > > > the different FunctionScope's, the ones that are allowed, the ones
> > that
> > > > are
> > > > > not. The impact of different cost categories. The different  steps
> > > needed
> > > > > to understand handling any of the supported data types  and
> > structures
> > > in
> > > > > drill.
> > > > >
> > > > > Here are a few of my current road blocks. Any pointers would be
> > greatly
> > > > > appreciated.
> > > > >
> > > > >
> > > > >    1. I've been trying to understand how to correctly use
> > > RepeatedHolders
> > > > >    of whatever type. For this discussion lets start with a
> > > > >    RepeatedBigIntHolder. I'm trying to figure out the best way to
> > > create
> > > > a
> > > > > new
> > > > >    one. I have not figured out where in the existing drill code
> > someone
> > > > > does
> > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace object is
> is
> > > > null
> > > > > to
> > > > >    start with. I created a new one in the startup section of the
> udf
> > > but
> > > > > the
> > > > >    vector was null. I can find no reference in creating a new
> > > > BigIntVector.
> > > > >    There is a way to create a BigIntVector and I did find an
> example
> > of
> > > > >    creating a new VarCharVector but I can't do that using the drill
> > jar
> > > > > files
> > > > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > > > >    the org.apache.drill.common.types.TypeProtos.MinorType classes
> do
> > > not
> > > > >    appear to be accessible from the drill jar files.
> > > > >    2. What is the best way to close out a UDF in the event it
> > generates
> > > > an
> > > > >    exception? Are there specific steps one should follow to make a
> > > clean
> > > > > exit
> > > > >    in a catch block that are beneficial to Drill?
> > > > >
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Jacques Nadeau <ja...@apache.org>.
*Holders are for both input and output.  You can also use CompleWriter for
output and FieldReader for input if you want to write or read a complex
value.

I don't think we've provided a really clean way to construct a
Repeated*Holder for output purposes.  You can probably do it by reaching
into a bunch of internal interfaces in Drill.  However, I would recommend
using the ComplexWriter output pattern for now.  This will be a little less
efficient but substantially less brittle.  I suggest you open up a jira for
using a Repeated*Holder as an output.

On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <te...@gmail.com> wrote:

> Holders are for input, I think.
>
> Try the different kinds of writers.
>
>
>
> On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com> wrote:
>
> > Using a repeatedholder as a @param I've got working. I was working on a
> > custom aggregator function using DrillAggFunc. In this I can do simple
> > things but If I want to build a list values and do something with it in
> the
> > final output method I think I need to use RepeatedHolders in the
> > @Workspace. To do that I need to create a new one in the setup method. I
> > can't get one built. They all require a BufferAllocator to be passed in
> to
> > build it. I have not found a way to get an allocator yet. Any
> suggestions?
> >
> > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > If you look at the zip function in
> > > https://github.com/mapr-demos/simple-drill-functions you can have an
> > > example of building a structure.
> > >
> > > The basic idea is that your output is denoted as
> > >
> > >         @Output
> > >         BaseWriter.ComplexWriter writer;
> > >
> > > The pattern for building a list of lists of integers is like this:
> > >
> > >         writer.setValueCount(n);
> > >         ...
> > >         BaseWriter.ListWriter outer = writer.rootAsList();
> > >         outer.start(); // [ outer list
> > >         ...
> > >         // for each inner list
> > >             BaseWriter.ListWriter inner = outer.list();
> > >             inner.start();
> > >             // for each inner list element
> > >                 inner.integer().writeInt(accessor.get(i));
> > >             }
> > >             inner.end();   // ] inner list
> > >         }
> > >         outer.end(); // ] outer list
> > >
> > >
> > >
> > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com>
> wrote:
> > >
> > > > I have working aggregation and simple UDFs. I've been trying to
> > document
> > > > and understand each of the options available in a Drill UDF.
> > > Understanding
> > > > the different FunctionScope's, the ones that are allowed, the ones
> that
> > > are
> > > > not. The impact of different cost categories. The different  steps
> > needed
> > > > to understand handling any of the supported data types  and
> structures
> > in
> > > > drill.
> > > >
> > > > Here are a few of my current road blocks. Any pointers would be
> greatly
> > > > appreciated.
> > > >
> > > >
> > > >    1. I've been trying to understand how to correctly use
> > RepeatedHolders
> > > >    of whatever type. For this discussion lets start with a
> > > >    RepeatedBigIntHolder. I'm trying to figure out the best way to
> > create
> > > a
> > > > new
> > > >    one. I have not figured out where in the existing drill code
> someone
> > > > does
> > > >    this. If I use a  RepeatedBigIntHolder as a Workspace object is is
> > > null
> > > > to
> > > >    start with. I created a new one in the startup section of the udf
> > but
> > > > the
> > > >    vector was null. I can find no reference in creating a new
> > > BigIntVector.
> > > >    There is a way to create a BigIntVector and I did find an example
> of
> > > >    creating a new VarCharVector but I can't do that using the drill
> jar
> > > > files
> > > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > > >    the org.apache.drill.common.types.TypeProtos.MinorType classes do
> > not
> > > >    appear to be accessible from the drill jar files.
> > > >    2. What is the best way to close out a UDF in the event it
> generates
> > > an
> > > >    exception? Are there specific steps one should follow to make a
> > clean
> > > > exit
> > > >    in a catch block that are beneficial to Drill?
> > > >
> > >
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
Holders are for input, I think.

Try the different kinds of writers.



On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <jb...@maprtech.com> wrote:

> Using a repeatedholder as a @param I've got working. I was working on a
> custom aggregator function using DrillAggFunc. In this I can do simple
> things but If I want to build a list values and do something with it in the
> final output method I think I need to use RepeatedHolders in the
> @Workspace. To do that I need to create a new one in the setup method. I
> can't get one built. They all require a BufferAllocator to be passed in to
> build it. I have not found a way to get an allocator yet. Any suggestions?
>
> On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > If you look at the zip function in
> > https://github.com/mapr-demos/simple-drill-functions you can have an
> > example of building a structure.
> >
> > The basic idea is that your output is denoted as
> >
> >         @Output
> >         BaseWriter.ComplexWriter writer;
> >
> > The pattern for building a list of lists of integers is like this:
> >
> >         writer.setValueCount(n);
> >         ...
> >         BaseWriter.ListWriter outer = writer.rootAsList();
> >         outer.start(); // [ outer list
> >         ...
> >         // for each inner list
> >             BaseWriter.ListWriter inner = outer.list();
> >             inner.start();
> >             // for each inner list element
> >                 inner.integer().writeInt(accessor.get(i));
> >             }
> >             inner.end();   // ] inner list
> >         }
> >         outer.end(); // ] outer list
> >
> >
> >
> > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com> wrote:
> >
> > > I have working aggregation and simple UDFs. I've been trying to
> document
> > > and understand each of the options available in a Drill UDF.
> > Understanding
> > > the different FunctionScope's, the ones that are allowed, the ones that
> > are
> > > not. The impact of different cost categories. The different  steps
> needed
> > > to understand handling any of the supported data types  and structures
> in
> > > drill.
> > >
> > > Here are a few of my current road blocks. Any pointers would be greatly
> > > appreciated.
> > >
> > >
> > >    1. I've been trying to understand how to correctly use
> RepeatedHolders
> > >    of whatever type. For this discussion lets start with a
> > >    RepeatedBigIntHolder. I'm trying to figure out the best way to
> create
> > a
> > > new
> > >    one. I have not figured out where in the existing drill code someone
> > > does
> > >    this. If I use a  RepeatedBigIntHolder as a Workspace object is is
> > null
> > > to
> > >    start with. I created a new one in the startup section of the udf
> but
> > > the
> > >    vector was null. I can find no reference in creating a new
> > BigIntVector.
> > >    There is a way to create a BigIntVector and I did find an example of
> > >    creating a new VarCharVector but I can't do that using the drill jar
> > > files
> > >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> > >    the org.apache.drill.common.types.TypeProtos.MinorType classes do
> not
> > >    appear to be accessible from the drill jar files.
> > >    2. What is the best way to close out a UDF in the event it generates
> > an
> > >    exception? Are there specific steps one should follow to make a
> clean
> > > exit
> > >    in a catch block that are beneficial to Drill?
> > >
> >
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Using a repeatedholder as a @param I've got working. I was working on a
custom aggregator function using DrillAggFunc. In this I can do simple
things but If I want to build a list values and do something with it in the
final output method I think I need to use RepeatedHolders in the
@Workspace. To do that I need to create a new one in the setup method. I
can't get one built. They all require a BufferAllocator to be passed in to
build it. I have not found a way to get an allocator yet. Any suggestions?

On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <te...@gmail.com> wrote:

> If you look at the zip function in
> https://github.com/mapr-demos/simple-drill-functions you can have an
> example of building a structure.
>
> The basic idea is that your output is denoted as
>
>         @Output
>         BaseWriter.ComplexWriter writer;
>
> The pattern for building a list of lists of integers is like this:
>
>         writer.setValueCount(n);
>         ...
>         BaseWriter.ListWriter outer = writer.rootAsList();
>         outer.start(); // [ outer list
>         ...
>         // for each inner list
>             BaseWriter.ListWriter inner = outer.list();
>             inner.start();
>             // for each inner list element
>                 inner.integer().writeInt(accessor.get(i));
>             }
>             inner.end();   // ] inner list
>         }
>         outer.end(); // ] outer list
>
>
>
> On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com> wrote:
>
> > I have working aggregation and simple UDFs. I've been trying to document
> > and understand each of the options available in a Drill UDF.
> Understanding
> > the different FunctionScope's, the ones that are allowed, the ones that
> are
> > not. The impact of different cost categories. The different  steps needed
> > to understand handling any of the supported data types  and structures in
> > drill.
> >
> > Here are a few of my current road blocks. Any pointers would be greatly
> > appreciated.
> >
> >
> >    1. I've been trying to understand how to correctly use RepeatedHolders
> >    of whatever type. For this discussion lets start with a
> >    RepeatedBigIntHolder. I'm trying to figure out the best way to create
> a
> > new
> >    one. I have not figured out where in the existing drill code someone
> > does
> >    this. If I use a  RepeatedBigIntHolder as a Workspace object is is
> null
> > to
> >    start with. I created a new one in the startup section of the udf but
> > the
> >    vector was null. I can find no reference in creating a new
> BigIntVector.
> >    There is a way to create a BigIntVector and I did find an example of
> >    creating a new VarCharVector but I can't do that using the drill jar
> > files
> >    from 1.0. The org.apache.drill.common.types.TypeProtos and
> >    the org.apache.drill.common.types.TypeProtos.MinorType classes do not
> >    appear to be accessible from the drill jar files.
> >    2. What is the best way to close out a UDF in the event it generates
> an
> >    exception? Are there specific steps one should follow to make a clean
> > exit
> >    in a catch block that are beneficial to Drill?
> >
>

Re: Some questions on UDFs

Posted by Ted Dunning <te...@gmail.com>.
If you look at the zip function in
https://github.com/mapr-demos/simple-drill-functions you can have an
example of building a structure.

The basic idea is that your output is denoted as

        @Output
        BaseWriter.ComplexWriter writer;

The pattern for building a list of lists of integers is like this:

        writer.setValueCount(n);
        ...
        BaseWriter.ListWriter outer = writer.rootAsList();
        outer.start(); // [ outer list
        ...
        // for each inner list
            BaseWriter.ListWriter inner = outer.list();
            inner.start();
            // for each inner list element
                inner.integer().writeInt(accessor.get(i));
            }
            inner.end();   // ] inner list
        }
        outer.end(); // ] outer list



On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <jb...@maprtech.com> wrote:

> I have working aggregation and simple UDFs. I've been trying to document
> and understand each of the options available in a Drill UDF. Understanding
> the different FunctionScope's, the ones that are allowed, the ones that are
> not. The impact of different cost categories. The different  steps needed
> to understand handling any of the supported data types  and structures in
> drill.
>
> Here are a few of my current road blocks. Any pointers would be greatly
> appreciated.
>
>
>    1. I've been trying to understand how to correctly use RepeatedHolders
>    of whatever type. For this discussion lets start with a
>    RepeatedBigIntHolder. I'm trying to figure out the best way to create a
> new
>    one. I have not figured out where in the existing drill code someone
> does
>    this. If I use a  RepeatedBigIntHolder as a Workspace object is is null
> to
>    start with. I created a new one in the startup section of the udf but
> the
>    vector was null. I can find no reference in creating a new BigIntVector.
>    There is a way to create a BigIntVector and I did find an example of
>    creating a new VarCharVector but I can't do that using the drill jar
> files
>    from 1.0. The org.apache.drill.common.types.TypeProtos and
>    the org.apache.drill.common.types.TypeProtos.MinorType classes do not
>    appear to be accessible from the drill jar files.
>    2. What is the best way to close out a UDF in the event it generates an
>    exception? Are there specific steps one should follow to make a clean
> exit
>    in a catch block that are beneficial to Drill?
>

Re: Some questions on UDFs

Posted by Jim Bates <jb...@maprtech.com>.
Found the TypeProtos in the drill-protocol jar.

On Sat, Jul 4, 2015 at 12:29 PM, Jim Bates <jb...@maprtech.com> wrote:

> I have working aggregation and simple UDFs. I've been trying to document
> and understand each of the options available in a Drill UDF. Understanding
> the different FunctionScope's, the ones that are allowed, the ones that are
> not. The impact of different cost categories. The different  steps needed
> to understand handling any of the supported data types  and structures in
> drill.
>
> Here are a few of my current road blocks. Any pointers would be greatly
> appreciated.
>
>
>    1. I've been trying to understand how to correctly use RepeatedHolders
>    of whatever type. For this discussion lets start with a
>    RepeatedBigIntHolder. I'm trying to figure out the best way to create a new
>    one. I have not figured out where in the existing drill code someone does
>    this. If I use a  RepeatedBigIntHolder as a Workspace object is is null to
>    start with. I created a new one in the startup section of the udf but the
>    vector was null. I can find no reference in creating a new BigIntVector.
>    There is a way to create a BigIntVector and I did find an example of
>    creating a new VarCharVector but I can't do that using the drill jar files
>    from 1.0. The org.apache.drill.common.types.TypeProtos and
>    the org.apache.drill.common.types.TypeProtos.MinorType classes do not
>    appear to be accessible from the drill jar files.
>    2. What is the best way to close out a UDF in the event it generates
>    an exception? Are there specific steps one should follow to make a clean
>    exit in a catch block that are beneficial to Drill?
>
>