You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Ted Dunning <te...@gmail.com> on 2019/08/12 14:35:06 UTC

complex data structure aggregators?

What is the current state of building aggregators that have complex state
via UDFs?

Is it possible to define multi-level aggregators in a UDF?

Can the output of a UDF be a byte array?


(these are three different questions)

Re: complex data structure aggregators?

Posted by Ted Dunning <te...@gmail.com>.
Can UDFs accumulate a fixed length binary value?



On Mon, Aug 12, 2019 at 11:23 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
>
> Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
>
> UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> a binary result?
>
> Also, as a nit, standard deviation doesn't require buffering all the data.
> It just requires that you have three accumulators, one for count, one for
> mean and one for mean squared deviation.  There is a slightly tricky
> algorithm called Welford's algorithm
> <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> which
> allows good numerical stability while computing this on-line.
>
> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Ted,
> >
> > Last I checked (when we wrote the book chapter on the subject), aggregate
> > state are limited to scalars and Drill-defined types. There is no support
> > to spill aggregate state, so that state will be lost if spilling is
> > required to handle large aggregate batches. The current solution works
> for
> > simple cases such as totals and averages.
> >
> > Aggregate UDFs share no state, so it is not possible for one function to
> > use state accumulated by another. If, for example, you want sum, average
> > and standard deviation, you'll have to accumulate the total three times,
> > average twice, and so on. Note that the std dev function will require
> > buffering all data in one's own array (without any spilling or other
> > support), to allow computing the (X-bar - X)^2 part of the calculation.
> >
> > A UDF can emit a byte array (have to check it this is true of aggregate
> > UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> > VarChar.
> >
> > All this is from memory and so is only approximately accurate. YMMV.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> > ted.dunning@gmail.com> wrote:
> >
> >  What is the current state of building aggregators that have complex
> state
> > via UDFs?
> >
> > Is it possible to define multi-level aggregators in a UDF?
> >
> > Can the output of a UDF be a byte array?
> >
> >
> > (these are three different questions)
> >
>

Re: complex data structure aggregators?

Posted by Ted Dunning <te...@gmail.com>.
Charles,

That might work. The t-digest will give us a median estimate.



On Mon, Aug 12, 2019 at 4:33 PM Charles Givre <cg...@gmail.com> wrote:

> HI Ted,
> You might want to take a look at this repo:
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> <
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> >
> This was an experiment to see if I could write a function to calculate a
> median.  I found a streaming algorithm to do so, but it required the use of
> two stacks.  This was more of a "can I do this" type challenge than a "will
> this really work well" but I did get it to work.  In any event, the way I
> did it was to use the @Workspace and use an ObjectHolder.  Maybe this will
> help you out.
> -- C
>
>
> > On Aug 12, 2019, at 6:03 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > You are now at the point that you'll have to experiment. Drill provides
> an annotation for aggregate state:  @Workspace. The value must be declared
> as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if
> so, how you allocate memory and remember the offset into the array. (My
> guess is that this may not work.)
> > @Workspace does allow you to specify a holder for a Java object, but
> such objects won't be spilled to disk when, say, the hash aggregate spills.
> This means your aggregate will work fine at small scale, then mysteriously
> fail once moved into production. Fun.
> >
> > Unless aggregate UDFs are special, they can return a VarChar or
> VarBinary result. The book explains how to do this for VarChar, some poking
> around in the Drill source should identify how to do so for VarBinary.
> (There are crufty details about allocating space, copying over data, etc.)
> >
> > FWIW: There is a pile of information on UDF internals on my GitHub Wiki.
> [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your
> specific questions, we can add the info to the Wiki.
> >
> > Thanks,
> > - Paul
> >
> > [1]
> https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
> >
> >
> > [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
> >
> >
> >
> >
> >
> >
> >    On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >
> > I am trying to figure out how to build an approximate percentile
> estimator.
> >
> > I have a fancy data structure that will do this. It can live in bounded
> > memory with no allocation. I can add numbers to the digest easily enough.
> > And the required results can be extracted from the structure.
> >
> > What I would need to know:
> >
> > - how to use a fixed array of bytes as the state of an aggregating UDF
> >
> > - how to pass in an argument to an aggregator OR (better) how to use the
> > binary result of an aggregator in another function.
> >
> > On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cg...@gmail.com> wrote:
> >
> >> Ted,
> >> Can we ask what it is you are trying to build a UDF for?
> >> --C
> >>
> >>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> >> wrote:
> >>>
> >>> Hi Ted,
> >>>
> >>> Thanks for the link; I suspected there was some trick for stddev. The
> >> point still stands that, if the algorithm requires multiple passes over
> the
> >> data (ML, say), can't be done in Drill.
> >>>
> >>> Each UDF must return exactly one value. It can return a map if you want
> >> multiple values (though someone would have to check that projection
> works
> >> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> >> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> >> accumulate a VarChar or VarBinary because Drill cannot insert values
> into
> >> an existing variable-length vector.
> >>>
> >>> UDFs need your knack for finding a workaround to get your job done;
> they
> >> have pretty strong limitations on the surface.
> >>>
> >>> Thanks,
> >>> - Paul
> >>>
> >>>
> >>>
> >>>     On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> >> ted.dunning@gmail.com> wrote:
> >>>
> >>> Is it possible for a UDF to produce multiple scalar results? Can it
> >> produce
> >>> a binary result?
> >>>
> >>> Also, as a nit, standard deviation doesn't require buffering all the
> >> data.
> >>> It just requires that you have three accumulators, one for count, one
> for
> >>> mean and one for mean squared deviation.  There is a slightly tricky
> >>> algorithm called Welford's algorithm
> >>> <
> >>
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >>>
> >>> which
> >>> allows good numerical stability while computing this on-line.
> >>>
> >>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0328@yahoo.com.invalid
> >
> >>> wrote:
> >>>
> >>>> Hi Ted,
> >>>>
> >>>> Last I checked (when we wrote the book chapter on the subject),
> >> aggregate
> >>>> state are limited to scalars and Drill-defined types. There is no
> >> support
> >>>> to spill aggregate state, so that state will be lost if spilling is
> >>>> required to handle large aggregate batches. The current solution works
> >> for
> >>>> simple cases such as totals and averages.
> >>>>
> >>>> Aggregate UDFs share no state, so it is not possible for one function
> to
> >>>> use state accumulated by another. If, for example, you want sum,
> average
> >>>> and standard deviation, you'll have to accumulate the total three
> times,
> >>>> average twice, and so on. Note that the std dev function will require
> >>>> buffering all data in one's own array (without any spilling or other
> >>>> support), to allow computing the (X-bar - X)^2 part of the
> calculation.
> >>>>
> >>>> A UDF can emit a byte array (have to check it this is true of
> aggregate
> >>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit
> a
> >>>> VarChar.
> >>>>
> >>>> All this is from memory and so is only approximately accurate. YMMV.
> >>>>
> >>>> Thanks,
> >>>> - Paul
> >>>>
> >>>>
> >>>>
> >>>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >>>> ted.dunning@gmail.com> wrote:
> >>>>
> >>>>   What is the current state of building aggregators that have complex
> >> state
> >>>> via UDFs?
> >>>>
> >>>> Is it possible to define multi-level aggregators in a UDF?
> >>>>
> >>>> Can the output of a UDF be a byte array?
> >>>>
> >>>>
> >>>> (these are three different questions)
> >>>>
> >>
> >>
>
>

Re: complex data structure aggregators?

Posted by Charles Givre <cg...@gmail.com>.
HI Ted, 
You might want to take a look at this repo: https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java <https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java>
This was an experiment to see if I could write a function to calculate a median.  I found a streaming algorithm to do so, but it required the use of two stacks.  This was more of a "can I do this" type challenge than a "will this really work well" but I did get it to work.  In any event, the way I did it was to use the @Workspace and use an ObjectHolder.  Maybe this will help you out.
-- C


> On Aug 12, 2019, at 6:03 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Ted,
> 
> You are now at the point that you'll have to experiment. Drill provides an annotation for aggregate state:  @Workspace. The value must be declared as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if so, how you allocate memory and remember the offset into the array. (My guess is that this may not work.)
> @Workspace does allow you to specify a holder for a Java object, but such objects won't be spilled to disk when, say, the hash aggregate spills. This means your aggregate will work fine at small scale, then mysteriously fail once moved into production. Fun.
> 
> Unless aggregate UDFs are special, they can return a VarChar or VarBinary result. The book explains how to do this for VarChar, some poking around in the Drill source should identify how to do so for VarBinary. (There are crufty details about allocating space, copying over data, etc.)
> 
> FWIW: There is a pile of information on UDF internals on my GitHub Wiki. [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your specific questions, we can add the info to the Wiki.
> 
> Thanks,
> - Paul
> 
> [1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
> 
> 
> [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
> 
> 
> 
> 
> 
> 
>    On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning <te...@gmail.com> wrote:  
> 
> I am trying to figure out how to build an approximate percentile estimator.
> 
> I have a fancy data structure that will do this. It can live in bounded
> memory with no allocation. I can add numbers to the digest easily enough.
> And the required results can be extracted from the structure.
> 
> What I would need to know:
> 
> - how to use a fixed array of bytes as the state of an aggregating UDF
> 
> - how to pass in an argument to an aggregator OR (better) how to use the
> binary result of an aggregator in another function.
> 
> On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cg...@gmail.com> wrote:
> 
>> Ted,
>> Can we ask what it is you are trying to build a UDF for?
>> --C
>> 
>>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <pa...@yahoo.com.INVALID>
>> wrote:
>>> 
>>> Hi Ted,
>>> 
>>> Thanks for the link; I suspected there was some trick for stddev. The
>> point still stands that, if the algorithm requires multiple passes over the
>> data (ML, say), can't be done in Drill.
>>> 
>>> Each UDF must return exactly one value. It can return a map if you want
>> multiple values (though someone would have to check that projection works
>> to convert these to scalar top-level values). AFAIK, a UDF can produce a
>> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
>> accumulate a VarChar or VarBinary because Drill cannot insert values into
>> an existing variable-length vector.
>>> 
>>> UDFs need your knack for finding a workaround to get your job done; they
>> have pretty strong limitations on the surface.
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>     On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
>> ted.dunning@gmail.com> wrote:
>>> 
>>> Is it possible for a UDF to produce multiple scalar results? Can it
>> produce
>>> a binary result?
>>> 
>>> Also, as a nit, standard deviation doesn't require buffering all the
>> data.
>>> It just requires that you have three accumulators, one for count, one for
>>> mean and one for mean squared deviation.  There is a slightly tricky
>>> algorithm called Welford's algorithm
>>> <
>> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
>>> 
>>> which
>>> allows good numerical stability while computing this on-line.
>>> 
>>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> Hi Ted,
>>>> 
>>>> Last I checked (when we wrote the book chapter on the subject),
>> aggregate
>>>> state are limited to scalars and Drill-defined types. There is no
>> support
>>>> to spill aggregate state, so that state will be lost if spilling is
>>>> required to handle large aggregate batches. The current solution works
>> for
>>>> simple cases such as totals and averages.
>>>> 
>>>> Aggregate UDFs share no state, so it is not possible for one function to
>>>> use state accumulated by another. If, for example, you want sum, average
>>>> and standard deviation, you'll have to accumulate the total three times,
>>>> average twice, and so on. Note that the std dev function will require
>>>> buffering all data in one's own array (without any spilling or other
>>>> support), to allow computing the (X-bar - X)^2 part of the calculation.
>>>> 
>>>> A UDF can emit a byte array (have to check it this is true of aggregate
>>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
>>>> VarChar.
>>>> 
>>>> All this is from memory and so is only approximately accurate. YMMV.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
>>>> ted.dunning@gmail.com> wrote:
>>>> 
>>>>   What is the current state of building aggregators that have complex
>> state
>>>> via UDFs?
>>>> 
>>>> Is it possible to define multi-level aggregators in a UDF?
>>>> 
>>>> Can the output of a UDF be a byte array?
>>>> 
>>>> 
>>>> (these are three different questions)
>>>> 
>> 
>> 


Re: complex data structure aggregators?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ted,

You are now at the point that you'll have to experiment. Drill provides an annotation for aggregate state:  @Workspace. The value must be declared as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if so, how you allocate memory and remember the offset into the array. (My guess is that this may not work.)
@Workspace does allow you to specify a holder for a Java object, but such objects won't be spilled to disk when, say, the hash aggregate spills. This means your aggregate will work fine at small scale, then mysteriously fail once moved into production. Fun.

Unless aggregate UDFs are special, they can return a VarChar or VarBinary result. The book explains how to do this for VarChar, some poking around in the Drill source should identify how to do so for VarBinary. (There are crufty details about allocating space, copying over data, etc.)

FWIW: There is a pile of information on UDF internals on my GitHub Wiki. [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your specific questions, we can add the info to the Wiki.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information


[2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs




 

    On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning <te...@gmail.com> wrote:  
 
 I am trying to figure out how to build an approximate percentile estimator.

I have a fancy data structure that will do this. It can live in bounded
memory with no allocation. I can add numbers to the digest easily enough.
And the required results can be extracted from the structure.

What I would need to know:

- how to use a fixed array of bytes as the state of an aggregating UDF

- how to pass in an argument to an aggregator OR (better) how to use the
binary result of an aggregator in another function.

On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cg...@gmail.com> wrote:

> Ted,
> Can we ask what it is you are trying to build a UDF for?
> --C
>
> > On Aug 12, 2019, at 2:23 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
> >
> > Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
> >
> > UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >
> > Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> > a binary result?
> >
> > Also, as a nit, standard deviation doesn't require buffering all the
> data.
> > It just requires that you have three accumulators, one for count, one for
> > mean and one for mean squared deviation.  There is a slightly tricky
> > algorithm called Welford's algorithm
> > <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> > which
> > allows good numerical stability while computing this on-line.
> >
> > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Ted,
> >>
> >> Last I checked (when we wrote the book chapter on the subject),
> aggregate
> >> state are limited to scalars and Drill-defined types. There is no
> support
> >> to spill aggregate state, so that state will be lost if spilling is
> >> required to handle large aggregate batches. The current solution works
> for
> >> simple cases such as totals and averages.
> >>
> >> Aggregate UDFs share no state, so it is not possible for one function to
> >> use state accumulated by another. If, for example, you want sum, average
> >> and standard deviation, you'll have to accumulate the total three times,
> >> average twice, and so on. Note that the std dev function will require
> >> buffering all data in one's own array (without any spilling or other
> >> support), to allow computing the (X-bar - X)^2 part of the calculation.
> >>
> >> A UDF can emit a byte array (have to check it this is true of aggregate
> >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> >> VarChar.
> >>
> >> All this is from memory and so is only approximately accurate. YMMV.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >> ted.dunning@gmail.com> wrote:
> >>
> >>  What is the current state of building aggregators that have complex
> state
> >> via UDFs?
> >>
> >> Is it possible to define multi-level aggregators in a UDF?
> >>
> >> Can the output of a UDF be a byte array?
> >>
> >>
> >> (these are three different questions)
> >>
>
>
  

Re: complex data structure aggregators?

Posted by Ted Dunning <te...@gmail.com>.
I am trying to figure out how to build an approximate percentile estimator.

I have a fancy data structure that will do this. It can live in bounded
memory with no allocation. I can add numbers to the digest easily enough.
And the required results can be extracted from the structure.

What I would need to know:

- how to use a fixed array of bytes as the state of an aggregating UDF

- how to pass in an argument to an aggregator OR (better) how to use the
binary result of an aggregator in another function.

On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cg...@gmail.com> wrote:

> Ted,
> Can we ask what it is you are trying to build a UDF for?
> --C
>
> > On Aug 12, 2019, at 2:23 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
> >
> > Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
> >
> > UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >
> > Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> > a binary result?
> >
> > Also, as a nit, standard deviation doesn't require buffering all the
> data.
> > It just requires that you have three accumulators, one for count, one for
> > mean and one for mean squared deviation.  There is a slightly tricky
> > algorithm called Welford's algorithm
> > <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> > which
> > allows good numerical stability while computing this on-line.
> >
> > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Ted,
> >>
> >> Last I checked (when we wrote the book chapter on the subject),
> aggregate
> >> state are limited to scalars and Drill-defined types. There is no
> support
> >> to spill aggregate state, so that state will be lost if spilling is
> >> required to handle large aggregate batches. The current solution works
> for
> >> simple cases such as totals and averages.
> >>
> >> Aggregate UDFs share no state, so it is not possible for one function to
> >> use state accumulated by another. If, for example, you want sum, average
> >> and standard deviation, you'll have to accumulate the total three times,
> >> average twice, and so on. Note that the std dev function will require
> >> buffering all data in one's own array (without any spilling or other
> >> support), to allow computing the (X-bar - X)^2 part of the calculation.
> >>
> >> A UDF can emit a byte array (have to check it this is true of aggregate
> >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> >> VarChar.
> >>
> >> All this is from memory and so is only approximately accurate. YMMV.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >> ted.dunning@gmail.com> wrote:
> >>
> >>   What is the current state of building aggregators that have complex
> state
> >> via UDFs?
> >>
> >> Is it possible to define multi-level aggregators in a UDF?
> >>
> >> Can the output of a UDF be a byte array?
> >>
> >>
> >> (these are three different questions)
> >>
>
>

Re: complex data structure aggregators?

Posted by Charles Givre <cg...@gmail.com>.
Ted, 
Can we ask what it is you are trying to build a UDF for?
--C

> On Aug 12, 2019, at 2:23 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Ted,
> 
> Thanks for the link; I suspected there was some trick for stddev. The point still stands that, if the algorithm requires multiple passes over the data (ML, say), can't be done in Drill.
> 
> Each UDF must return exactly one value. It can return a map if you want multiple values (though someone would have to check that projection works to convert these to scalar top-level values). AFAIK, a UDF can produce a binary buffer as output (type VarBinary). But, an aggregate UDF cannot accumulate a VarChar or VarBinary because Drill cannot insert values into an existing variable-length vector.
> 
> UDFs need your knack for finding a workaround to get your job done; they have pretty strong limitations on the surface.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <te...@gmail.com> wrote:  
> 
> Is it possible for a UDF to produce multiple scalar results? Can it produce
> a binary result?
> 
> Also, as a nit, standard deviation doesn't require buffering all the data.
> It just requires that you have three accumulators, one for count, one for
> mean and one for mean squared deviation.  There is a slightly tricky
> algorithm called Welford's algorithm
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
> which
> allows good numerical stability while computing this on-line.
> 
> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> Hi Ted,
>> 
>> Last I checked (when we wrote the book chapter on the subject), aggregate
>> state are limited to scalars and Drill-defined types. There is no support
>> to spill aggregate state, so that state will be lost if spilling is
>> required to handle large aggregate batches. The current solution works for
>> simple cases such as totals and averages.
>> 
>> Aggregate UDFs share no state, so it is not possible for one function to
>> use state accumulated by another. If, for example, you want sum, average
>> and standard deviation, you'll have to accumulate the total three times,
>> average twice, and so on. Note that the std dev function will require
>> buffering all data in one's own array (without any spilling or other
>> support), to allow computing the (X-bar - X)^2 part of the calculation.
>> 
>> A UDF can emit a byte array (have to check it this is true of aggregate
>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
>> VarChar.
>> 
>> All this is from memory and so is only approximately accurate. YMMV.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
>> ted.dunning@gmail.com> wrote:
>> 
>>   What is the current state of building aggregators that have complex state
>> via UDFs?
>> 
>> Is it possible to define multi-level aggregators in a UDF?
>> 
>> Can the output of a UDF be a byte array?
>> 
>> 
>> (these are three different questions)
>> 


Re: complex data structure aggregators?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ted,

Thanks for the link; I suspected there was some trick for stddev. The point still stands that, if the algorithm requires multiple passes over the data (ML, say), can't be done in Drill.

Each UDF must return exactly one value. It can return a map if you want multiple values (though someone would have to check that projection works to convert these to scalar top-level values). AFAIK, a UDF can produce a binary buffer as output (type VarBinary). But, an aggregate UDF cannot accumulate a VarChar or VarBinary because Drill cannot insert values into an existing variable-length vector.

UDFs need your knack for finding a workaround to get your job done; they have pretty strong limitations on the surface.

Thanks,
- Paul

 

    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <te...@gmail.com> wrote:  
 
 Is it possible for a UDF to produce multiple scalar results? Can it produce
a binary result?

Also, as a nit, standard deviation doesn't require buffering all the data.
It just requires that you have three accumulators, one for count, one for
mean and one for mean squared deviation.  There is a slightly tricky
algorithm called Welford's algorithm
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
which
allows good numerical stability while computing this on-line.

On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> Last I checked (when we wrote the book chapter on the subject), aggregate
> state are limited to scalars and Drill-defined types. There is no support
> to spill aggregate state, so that state will be lost if spilling is
> required to handle large aggregate batches. The current solution works for
> simple cases such as totals and averages.
>
> Aggregate UDFs share no state, so it is not possible for one function to
> use state accumulated by another. If, for example, you want sum, average
> and standard deviation, you'll have to accumulate the total three times,
> average twice, and so on. Note that the std dev function will require
> buffering all data in one's own array (without any spilling or other
> support), to allow computing the (X-bar - X)^2 part of the calculation.
>
> A UDF can emit a byte array (have to check it this is true of aggregate
> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> VarChar.
>
> All this is from memory and so is only approximately accurate. YMMV.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  What is the current state of building aggregators that have complex state
> via UDFs?
>
> Is it possible to define multi-level aggregators in a UDF?
>
> Can the output of a UDF be a byte array?
>
>
> (these are three different questions)
>
  

Re: complex data structure aggregators?

Posted by Ted Dunning <te...@gmail.com>.
Is it possible for a UDF to produce multiple scalar results? Can it produce
a binary result?

Also, as a nit, standard deviation doesn't require buffering all the data.
It just requires that you have three accumulators, one for count, one for
mean and one for mean squared deviation.  There is a slightly tricky
algorithm called Welford's algorithm
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
which
allows good numerical stability while computing this on-line.

On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> Last I checked (when we wrote the book chapter on the subject), aggregate
> state are limited to scalars and Drill-defined types. There is no support
> to spill aggregate state, so that state will be lost if spilling is
> required to handle large aggregate batches. The current solution works for
> simple cases such as totals and averages.
>
> Aggregate UDFs share no state, so it is not possible for one function to
> use state accumulated by another. If, for example, you want sum, average
> and standard deviation, you'll have to accumulate the total three times,
> average twice, and so on. Note that the std dev function will require
> buffering all data in one's own array (without any spilling or other
> support), to allow computing the (X-bar - X)^2 part of the calculation.
>
> A UDF can emit a byte array (have to check it this is true of aggregate
> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> VarChar.
>
> All this is from memory and so is only approximately accurate. YMMV.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  What is the current state of building aggregators that have complex state
> via UDFs?
>
> Is it possible to define multi-level aggregators in a UDF?
>
> Can the output of a UDF be a byte array?
>
>
> (these are three different questions)
>

Re: complex data structure aggregators?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ted,

Last I checked (when we wrote the book chapter on the subject), aggregate state are limited to scalars and Drill-defined types. There is no support to spill aggregate state, so that state will be lost if spilling is required to handle large aggregate batches. The current solution works for simple cases such as totals and averages.

Aggregate UDFs share no state, so it is not possible for one function to use state accumulated by another. If, for example, you want sum, average and standard deviation, you'll have to accumulate the total three times, average twice, and so on. Note that the std dev function will require buffering all data in one's own array (without any spilling or other support), to allow computing the (X-bar - X)^2 part of the calculation.

A UDF can emit a byte array (have to check it this is true of aggregate UDFs). A VarChar is simply a special kind of array, and UDFs can emit a VarChar.

All this is from memory and so is only approximately accurate. YMMV.

Thanks,
- Paul

 

    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <te...@gmail.com> wrote:  
 
 What is the current state of building aggregators that have complex state
via UDFs?

Is it possible to define multi-level aggregators in a UDF?

Can the output of a UDF be a byte array?


(these are three different questions)