You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Riyafa Abdul Hameed <ri...@apache.org> on 2017/07/16 02:31:57 UTC

Creating aggregate functions

Dear all,

I am trying to create aggregate functions and I see there are more than one
function descriptors for one single function.
For example the function array_count(collection) has the following
descriptors:


   - ScalarCountAggregateDescriptor
   - SerializableCountAggregateDescriptor
   - CountAggregateDescriptor

I am not sure I understand the difference between each of this. Can you
please provide and example or point me to a documentation entry to learn
how to properly implement aggregate functions?

The function I am trying to implement is ST_Extent.
<https://postgis.net/docs/manual-1.4/ST_Extent.html>

Thank you.

Yours sincerely,

Riyafa

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>.

Hi,

Does the creation of aggregate functions in AsterixDB based on some
programming model like mapreduce? If so can you please suggest links to
learn this so that I could understand better. I still do not get the
overall picture on the creation of aggregate functions (It might also be
because creation of normal functions is pretty straightforward as far as I
am concerned).

I started on the implementation here[1] and am stuck there. I will try
again and update this commit.

[1]
https://github.com/riyafa/asterixdb/commit/dc437ddcc0ac175b20120047facca337e431fa92

On 23 July 2017 at 22:59, Yingyi Bu <bu...@gmail.com> wrote:

> Sorry, a typo:
>
> AVG:  that's the logical function in the logical plan.
>
> On Sun, Jul 23, 2017 at 10:29 AM, Yingyi Bu <bu...@gmail.com> wrote:
>
> > >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
> >
> > AVG:  that's the local function in the local plan.
> > LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
> > computation of average.  LOCAL_AVG aggregates the sum/count at the local
> > data source, INTERMEDIATE_AVG aggregates the sum/count over partially
> > aggregated sums/counts, and GLOBAL_AVG computes the final average value
> > from intermediate sums/counts.
> >
> > Best,
> > Yingyi
> >
> >
> > On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
> > riyafa.12@cse.mrt.ac.lk> wrote:
> >
> >> Hi,
> >>
> >> Thanks for the explanation.
> >> But there are so many things I still don't understand. One of them is
> for
> >> the avg function itself there are several FuntionIdentifiers. What do
> they
> >> all mean?
> >>
> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
> >>
> >> What do they all mean?
> >> Please help
> >>
> >> On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
> >>
> >> > Hi Riyafa,
> >> >
> >> >    >> ScalarCountAggregateDescriptor
> >> >   It's used for counting a scalar array that appears inside a tuple.
> >> >   For example:
> >> >   SELECT u.id, array_count(u.friends)
> >> >   FROM users u;
> >> >
> >> >    >> SerializableCountAggregateDescriptor
> >> >    Serialized aggregation descriptor implementations are only used in
> >> > hash-based group-by.
> >> >    For example:
> >> >    SELECT u.city, count(*)
> >> >    FROM users u
> >> >    /*+ hash */
> >> >    GROUP BY u.city;
> >> >
> >> >   If your aggregation function doesn't have a fixed-byte-sized state,
> >> you
> >> > don't need to worry about that or implement that.
> >> >
> >> >    >> CountAggregateDescriptor
> >> >    This is used in group-by or global aggregate:
> >> >    For example:
> >> >    SELECT u.city, count(*)
> >> >    FROM users u
> >> >    GROUP BY u.city;
> >> >
> >> >    SELECT count(*) FROM users;
> >> >
> >> >
> >> > Best,
> >> > Yingyi
> >> >
> >> >
> >> > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <
> riyafa@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > Hi again,
> >> > >
> >> > > Any suggestions on this? Or anyone I can reach to who are not on
> this
> >> > list
> >> > > or not active on the list?
> >> > >
> >> > > Thank you.
> >> > >
> >> > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
> >> wrote:
> >> > >
> >> > > > Hi again,
> >> > > >
> >> > > > I think I can understand how to write the descriptor in the
> >> packages:
> >> > > > org.apache.asterix.runtime.aggregates.std and
> >> > > org.apache.asterix.runtime.aggregates.scalar.
> >> > > > But I am not sure I understand how to write the descriptor in the
> >> > > package:
> >> > > > org.apache.asterix.runtime.aggregates.serializable.std  because
> it
> >> > > > requires setting a state in the init function that doesn't seem to
> >> > have a
> >> > > > pattern in the other descriptors.
> >> > > > Also I don't seem to understand the reasons for implementing each
> of
> >> > > these
> >> > > > descriptors for the aggregate functions.
> >> > > >
> >> > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
> >> riyafa.12@cse.mrt.ac.lk
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hi all,
> >> > > >>
> >> > > >> I meant any explanation on the implementation of aggregate
> >> functions
> >> > in
> >> > > >> AsterixDB would be highly appreciated.
> >> > > >>
> >> > > >> Thank you.
> >> > > >> Yours sincerely,
> >> > > >> Riyafa
> >> > > >>
> >> > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <riyafa@apache.org
> >
> >> > > wrote:
> >> > > >>
> >> > > >>> Dear all,
> >> > > >>>
> >> > > >>> I am trying to create aggregate functions and I see there are
> more
> >> > than
> >> > > >>> one function descriptors for one single function.
> >> > > >>> For example the function array_count(collection) has the
> following
> >> > > >>> descriptors:
> >> > > >>>
> >> > > >>>
> >> > > >>>    - ScalarCountAggregateDescriptor
> >> > > >>>    - SerializableCountAggregateDescriptor
> >> > > >>>    - CountAggregateDescriptor
> >> > > >>>
> >> > > >>> I am not sure I understand the difference between each of this.
> >> Can
> >> > you
> >> > > >>> please provide and example or point me to a documentation entry
> to
> >> > > learn
> >> > > >>> how to properly implement aggregate functions?
> >> > > >>>
> >> > > >>> The function I am trying to implement is ST_Extent.
> >> > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> >> > > >>>
> >> > > >>> Thank you.
> >> > > >>>
> >> > > >>> Yours sincerely,
> >> > > >>>
> >> > > >>> Riyafa
> >> > > >>>
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >> Riyafa Abdul Hameed
> >> > > >> Undergraduate, University of Moratuwa
> >> > > >>
> >> > > >> Email: riyafa.12@cse.mrt.ac.lk
> >> > > >> Website: https://riyafa.wordpress.com/ <
> >> http://riyafa.wordpress.com/>
> >> > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riy
> >> afa>
> >> > > >> <http://twitter.com/Riyafa1>
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Riyafa Abdul Hameed
> >> Undergraduate, University of Moratuwa
> >>
> >> Email: riyafa.12@cse.mrt.ac.lk
> >> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> >> <http://twitter.com/Riyafa1>
> >>
> >
> >
>



-- 
Riyafa Abdul Hameed
Undergraduate, University of Moratuwa

Email: riyafa.12@cse.mrt.ac.lk
Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
<http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
<http://twitter.com/Riyafa1>

Re: Creating aggregate functions

Posted by Yingyi Bu <bu...@gmail.com>.

Sorry, a typo:

AVG:  that's the logical function in the logical plan.

On Sun, Jul 23, 2017 at 10:29 AM, Yingyi Bu <bu...@gmail.com> wrote:

> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>
> AVG:  that's the local function in the local plan.
> LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
> computation of average.  LOCAL_AVG aggregates the sum/count at the local
> data source, INTERMEDIATE_AVG aggregates the sum/count over partially
> aggregated sums/counts, and GLOBAL_AVG computes the final average value
> from intermediate sums/counts.
>
> Best,
> Yingyi
>
>
> On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
> riyafa.12@cse.mrt.ac.lk> wrote:
>
>> Hi,
>>
>> Thanks for the explanation.
>> But there are so many things I still don't understand. One of them is for
>> the avg function itself there are several FuntionIdentifiers. What do they
>> all mean?
>>
>> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>>
>> What do they all mean?
>> Please help
>>
>> On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
>>
>> > Hi Riyafa,
>> >
>> >    >> ScalarCountAggregateDescriptor
>> >   It's used for counting a scalar array that appears inside a tuple.
>> >   For example:
>> >   SELECT u.id, array_count(u.friends)
>> >   FROM users u;
>> >
>> >    >> SerializableCountAggregateDescriptor
>> >    Serialized aggregation descriptor implementations are only used in
>> > hash-based group-by.
>> >    For example:
>> >    SELECT u.city, count(*)
>> >    FROM users u
>> >    /*+ hash */
>> >    GROUP BY u.city;
>> >
>> >   If your aggregation function doesn't have a fixed-byte-sized state,
>> you
>> > don't need to worry about that or implement that.
>> >
>> >    >> CountAggregateDescriptor
>> >    This is used in group-by or global aggregate:
>> >    For example:
>> >    SELECT u.city, count(*)
>> >    FROM users u
>> >    GROUP BY u.city;
>> >
>> >    SELECT count(*) FROM users;
>> >
>> >
>> > Best,
>> > Yingyi
>> >
>> >
>> > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <riyafa@apache.org
>> >
>> > wrote:
>> >
>> > > Hi again,
>> > >
>> > > Any suggestions on this? Or anyone I can reach to who are not on this
>> > list
>> > > or not active on the list?
>> > >
>> > > Thank you.
>> > >
>> > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
>> wrote:
>> > >
>> > > > Hi again,
>> > > >
>> > > > I think I can understand how to write the descriptor in the
>> packages:
>> > > > org.apache.asterix.runtime.aggregates.std and
>> > > org.apache.asterix.runtime.aggregates.scalar.
>> > > > But I am not sure I understand how to write the descriptor in the
>> > > package:
>> > > > org.apache.asterix.runtime.aggregates.serializable.std  because it
>> > > > requires setting a state in the init function that doesn't seem to
>> > have a
>> > > > pattern in the other descriptors.
>> > > > Also I don't seem to understand the reasons for implementing each of
>> > > these
>> > > > descriptors for the aggregate functions.
>> > > >
>> > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
>> riyafa.12@cse.mrt.ac.lk
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> I meant any explanation on the implementation of aggregate
>> functions
>> > in
>> > > >> AsterixDB would be highly appreciated.
>> > > >>
>> > > >> Thank you.
>> > > >> Yours sincerely,
>> > > >> Riyafa
>> > > >>
>> > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
>> > > wrote:
>> > > >>
>> > > >>> Dear all,
>> > > >>>
>> > > >>> I am trying to create aggregate functions and I see there are more
>> > than
>> > > >>> one function descriptors for one single function.
>> > > >>> For example the function array_count(collection) has the following
>> > > >>> descriptors:
>> > > >>>
>> > > >>>
>> > > >>>    - ScalarCountAggregateDescriptor
>> > > >>>    - SerializableCountAggregateDescriptor
>> > > >>>    - CountAggregateDescriptor
>> > > >>>
>> > > >>> I am not sure I understand the difference between each of this.
>> Can
>> > you
>> > > >>> please provide and example or point me to a documentation entry to
>> > > learn
>> > > >>> how to properly implement aggregate functions?
>> > > >>>
>> > > >>> The function I am trying to implement is ST_Extent.
>> > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>> > > >>>
>> > > >>> Thank you.
>> > > >>>
>> > > >>> Yours sincerely,
>> > > >>>
>> > > >>> Riyafa
>> > > >>>
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Riyafa Abdul Hameed
>> > > >> Undergraduate, University of Moratuwa
>> > > >>
>> > > >> Email: riyafa.12@cse.mrt.ac.lk
>> > > >> Website: https://riyafa.wordpress.com/ <
>> http://riyafa.wordpress.com/>
>> > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riy
>> afa>
>> > > >> <http://twitter.com/Riyafa1>
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Riyafa Abdul Hameed
>> Undergraduate, University of Moratuwa
>>
>> Email: riyafa.12@cse.mrt.ac.lk
>> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
>> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
>> <http://twitter.com/Riyafa1>
>>
>
>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>.

Hi,

So if I am to check or debug the use of each of the different descriptors
for a function I am to use a clustered setup. Am I right? I am not sure how
to debug in a clustered environment. This is to check how array_avg()
function works and see if understand better.

Further I see that the array_avg function seem to translate into sql-avg
function? I am not sure where that happens.

Sorry about all the silly questions.

On 25 July 2017 at 06:54, Preston Carman <pr...@apache.org> wrote:

> When dealing with aggregates and query plans, I find it helpful to
> think about how the aggregate will work in a distributed environment.
> AsterixDB compiler will make optimizations based on the types of data
> partitioning. If the data is unpartitioned then a single aggregate
> operator and function can calculate the result. If the data is
> partitioned, then sending all the data must be send to a single node
> for processing, which is not very efficient. The aggregate process
> could be split up into two steps. AsterixDB optimizes the query by
> running a process on each partition locally and then sending an
> intermediate result to a single node to create the final aggregate
> result.
>
> COUNT
> In the case of count, the local process is COUNT, but the global
> aggregate process is SUM. We do not want to count responses, but sum
> the total local count values.
>
> AVG
> In the count case, we use a complete separate aggregate function for
> the global step. Consider AVG, to compute the average you need to know
> the count and sum. In this case the local functions find both the
> count and sum. These values are then passed to a global aggregate
> function which uses these local results to calculate the average
> aggregate result.
>
> Take a look at the query plans for a COUNT and AVG query. The
> optimized query plan will show you the two aggregate operators.
>
> As you look at the code, AVG would probably be more informative about
> the full aggregation workflow.
>
>
> On Mon, Jul 24, 2017 at 8:28 AM, Riyafa Abdul Hameed
> <ri...@cse.mrt.ac.lk> wrote:
> > On 23 July 2017 at 22:59, Yingyi Bu <bu...@gmail.com> wrote:
> >
> >> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
> >>
> >> AVG:  that's the local function in the local plan.
> >> LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
> >> computation of average.  LOCAL_AVG aggregates the sum/count at the local
> >> data source, INTERMEDIATE_AVG aggregates the sum/count over partially
> >> aggregated sums/counts, and GLOBAL_AVG computes the final average value
> >> from intermediate sums/counts.
> >>
> >
> > How do we decide if we need these descriptors? COUNT seems to have only
> > one descriptor
> >
> >
> >>
> >> Best,
> >> Yingyi
> >>
> >>
> >> On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
> >> riyafa.12@cse.mrt.ac.lk> wrote:
> >>
> >> > Hi,
> >> >
> >> > Thanks for the explanation.
> >> > But there are so many things I still don't understand. One of them is
> for
> >> > the avg function itself there are several FuntionIdentifiers. What do
> >> they
> >> > all mean?
> >> >
> >> > I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
> >> >
> >> > What do they all mean?
> >> > Please help
> >> >
> >> > On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
> >> >
> >> > > Hi Riyafa,
> >> > >
> >> > >    >> ScalarCountAggregateDescriptor
> >> > >   It's used for counting a scalar array that appears inside a tuple.
> >> > >   For example:
> >> > >   SELECT u.id, array_count(u.friends)
> >> > >   FROM users u;
> >> > >
> >> > >    >> SerializableCountAggregateDescriptor
> >> > >    Serialized aggregation descriptor implementations are only used
> in
> >> > > hash-based group-by.
> >> > >    For example:
> >> > >    SELECT u.city, count(*)
> >> > >    FROM users u
> >> > >    /*+ hash */
> >> > >    GROUP BY u.city;
> >> > >
> >> > >   If your aggregation function doesn't have a fixed-byte-sized
> state,
> >> you
> >> > > don't need to worry about that or implement that.
> >> > >
> >> > >    >> CountAggregateDescriptor
> >> > >    This is used in group-by or global aggregate:
> >> > >    For example:
> >> > >    SELECT u.city, count(*)
> >> > >    FROM users u
> >> > >    GROUP BY u.city;
> >> > >
> >> > >    SELECT count(*) FROM users;
> >> > >
> >> > >
> >> > > Best,
> >> > > Yingyi
> >> > >
> >> > >
> >> > > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <
> >> riyafa@apache.org>
> >> > > wrote:
> >> > >
> >> > > > Hi again,
> >> > > >
> >> > > > Any suggestions on this? Or anyone I can reach to who are not on
> this
> >> > > list
> >> > > > or not active on the list?
> >> > > >
> >> > > > Thank you.
> >> > > >
> >> > > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
> >> > wrote:
> >> > > >
> >> > > > > Hi again,
> >> > > > >
> >> > > > > I think I can understand how to write the descriptor in the
> >> packages:
> >> > > > > org.apache.asterix.runtime.aggregates.std and
> >> > > > org.apache.asterix.runtime.aggregates.scalar.
> >> > > > > But I am not sure I understand how to write the descriptor in
> the
> >> > > > package:
> >> > > > > org.apache.asterix.runtime.aggregates.serializable.std
> because it
> >> > > > > requires setting a state in the init function that doesn't seem
> to
> >> > > have a
> >> > > > > pattern in the other descriptors.
> >> > > > > Also I don't seem to understand the reasons for implementing
> each
> >> of
> >> > > > these
> >> > > > > descriptors for the aggregate functions.
> >> > > > >
> >> > > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
> >> > riyafa.12@cse.mrt.ac.lk
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi all,
> >> > > > >>
> >> > > > >> I meant any explanation on the implementation of aggregate
> >> functions
> >> > > in
> >> > > > >> AsterixDB would be highly appreciated.
> >> > > > >>
> >> > > > >> Thank you.
> >> > > > >> Yours sincerely,
> >> > > > >> Riyafa
> >> > > > >>
> >> > > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <
> riyafa@apache.org>
> >> > > > wrote:
> >> > > > >>
> >> > > > >>> Dear all,
> >> > > > >>>
> >> > > > >>> I am trying to create aggregate functions and I see there are
> >> more
> >> > > than
> >> > > > >>> one function descriptors for one single function.
> >> > > > >>> For example the function array_count(collection) has the
> >> following
> >> > > > >>> descriptors:
> >> > > > >>>
> >> > > > >>>
> >> > > > >>>    - ScalarCountAggregateDescriptor
> >> > > > >>>    - SerializableCountAggregateDescriptor
> >> > > > >>>    - CountAggregateDescriptor
> >> > > > >>>
> >> > > > >>> I am not sure I understand the difference between each of
> this.
> >> Can
> >> > > you
> >> > > > >>> please provide and example or point me to a documentation
> entry
> >> to
> >> > > > learn
> >> > > > >>> how to properly implement aggregate functions?
> >> > > > >>>
> >> > > > >>> The function I am trying to implement is ST_Extent.
> >> > > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> >> > > > >>>
> >> > > > >>> Thank you.
> >> > > > >>>
> >> > > > >>> Yours sincerely,
> >> > > > >>>
> >> > > > >>> Riyafa
> >> > > > >>>
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> --
> >> > > > >> Riyafa Abdul Hameed
> >> > > > >> Undergraduate, University of Moratuwa
> >> > > > >>
> >> > > > >> Email: riyafa.12@cse.mrt.ac.lk
> >> > > > >> Website: https://riyafa.wordpress.com/ <
> >> > http://riyafa.wordpress.com/>
> >> > > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/
> >> riyafa
> >> > >
> >> > > > >> <http://twitter.com/Riyafa1>
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Riyafa Abdul Hameed
> >> > Undergraduate, University of Moratuwa
> >> >
> >> > Email: riyafa.12@cse.mrt.ac.lk
> >> > Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> >> > <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> >> > <http://twitter.com/Riyafa1>
> >> >
> >>
> >
> >
> >
> > --
> > Riyafa Abdul Hameed
> > Undergraduate, University of Moratuwa
> >
> > Email: riyafa.12@cse.mrt.ac.lk
> > Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> > <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> > <http://twitter.com/Riyafa1>
>



-- 
Riyafa Abdul Hameed
Undergraduate, University of Moratuwa

Email: riyafa.12@cse.mrt.ac.lk
Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
<http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
<http://twitter.com/Riyafa1>

Re: Creating aggregate functions

Posted by Preston Carman <pr...@apache.org>.

When dealing with aggregates and query plans, I find it helpful to
think about how the aggregate will work in a distributed environment.
AsterixDB compiler will make optimizations based on the types of data
partitioning. If the data is unpartitioned then a single aggregate
operator and function can calculate the result. If the data is
partitioned, then sending all the data must be send to a single node
for processing, which is not very efficient. The aggregate process
could be split up into two steps. AsterixDB optimizes the query by
running a process on each partition locally and then sending an
intermediate result to a single node to create the final aggregate
result.

COUNT
In the case of count, the local process is COUNT, but the global
aggregate process is SUM. We do not want to count responses, but sum
the total local count values.

AVG
In the count case, we use a complete separate aggregate function for
the global step. Consider AVG, to compute the average you need to know
the count and sum. In this case the local functions find both the
count and sum. These values are then passed to a global aggregate
function which uses these local results to calculate the average
aggregate result.

Take a look at the query plans for a COUNT and AVG query. The
optimized query plan will show you the two aggregate operators.

As you look at the code, AVG would probably be more informative about
the full aggregation workflow.


On Mon, Jul 24, 2017 at 8:28 AM, Riyafa Abdul Hameed
<ri...@cse.mrt.ac.lk> wrote:
> On 23 July 2017 at 22:59, Yingyi Bu <bu...@gmail.com> wrote:
>
>> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>>
>> AVG:  that's the local function in the local plan.
>> LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
>> computation of average.  LOCAL_AVG aggregates the sum/count at the local
>> data source, INTERMEDIATE_AVG aggregates the sum/count over partially
>> aggregated sums/counts, and GLOBAL_AVG computes the final average value
>> from intermediate sums/counts.
>>
>
> How do we decide if we need these descriptors? COUNT seems to have only
> one descriptor
>
>
>>
>> Best,
>> Yingyi
>>
>>
>> On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
>> riyafa.12@cse.mrt.ac.lk> wrote:
>>
>> > Hi,
>> >
>> > Thanks for the explanation.
>> > But there are so many things I still don't understand. One of them is for
>> > the avg function itself there are several FuntionIdentifiers. What do
>> they
>> > all mean?
>> >
>> > I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>> >
>> > What do they all mean?
>> > Please help
>> >
>> > On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
>> >
>> > > Hi Riyafa,
>> > >
>> > >    >> ScalarCountAggregateDescriptor
>> > >   It's used for counting a scalar array that appears inside a tuple.
>> > >   For example:
>> > >   SELECT u.id, array_count(u.friends)
>> > >   FROM users u;
>> > >
>> > >    >> SerializableCountAggregateDescriptor
>> > >    Serialized aggregation descriptor implementations are only used in
>> > > hash-based group-by.
>> > >    For example:
>> > >    SELECT u.city, count(*)
>> > >    FROM users u
>> > >    /*+ hash */
>> > >    GROUP BY u.city;
>> > >
>> > >   If your aggregation function doesn't have a fixed-byte-sized state,
>> you
>> > > don't need to worry about that or implement that.
>> > >
>> > >    >> CountAggregateDescriptor
>> > >    This is used in group-by or global aggregate:
>> > >    For example:
>> > >    SELECT u.city, count(*)
>> > >    FROM users u
>> > >    GROUP BY u.city;
>> > >
>> > >    SELECT count(*) FROM users;
>> > >
>> > >
>> > > Best,
>> > > Yingyi
>> > >
>> > >
>> > > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <
>> riyafa@apache.org>
>> > > wrote:
>> > >
>> > > > Hi again,
>> > > >
>> > > > Any suggestions on this? Or anyone I can reach to who are not on this
>> > > list
>> > > > or not active on the list?
>> > > >
>> > > > Thank you.
>> > > >
>> > > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
>> > wrote:
>> > > >
>> > > > > Hi again,
>> > > > >
>> > > > > I think I can understand how to write the descriptor in the
>> packages:
>> > > > > org.apache.asterix.runtime.aggregates.std and
>> > > > org.apache.asterix.runtime.aggregates.scalar.
>> > > > > But I am not sure I understand how to write the descriptor in the
>> > > > package:
>> > > > > org.apache.asterix.runtime.aggregates.serializable.std  because it
>> > > > > requires setting a state in the init function that doesn't seem to
>> > > have a
>> > > > > pattern in the other descriptors.
>> > > > > Also I don't seem to understand the reasons for implementing each
>> of
>> > > > these
>> > > > > descriptors for the aggregate functions.
>> > > > >
>> > > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
>> > riyafa.12@cse.mrt.ac.lk
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> I meant any explanation on the implementation of aggregate
>> functions
>> > > in
>> > > > >> AsterixDB would be highly appreciated.
>> > > > >>
>> > > > >> Thank you.
>> > > > >> Yours sincerely,
>> > > > >> Riyafa
>> > > > >>
>> > > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
>> > > > wrote:
>> > > > >>
>> > > > >>> Dear all,
>> > > > >>>
>> > > > >>> I am trying to create aggregate functions and I see there are
>> more
>> > > than
>> > > > >>> one function descriptors for one single function.
>> > > > >>> For example the function array_count(collection) has the
>> following
>> > > > >>> descriptors:
>> > > > >>>
>> > > > >>>
>> > > > >>>    - ScalarCountAggregateDescriptor
>> > > > >>>    - SerializableCountAggregateDescriptor
>> > > > >>>    - CountAggregateDescriptor
>> > > > >>>
>> > > > >>> I am not sure I understand the difference between each of this.
>> Can
>> > > you
>> > > > >>> please provide and example or point me to a documentation entry
>> to
>> > > > learn
>> > > > >>> how to properly implement aggregate functions?
>> > > > >>>
>> > > > >>> The function I am trying to implement is ST_Extent.
>> > > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>> > > > >>>
>> > > > >>> Thank you.
>> > > > >>>
>> > > > >>> Yours sincerely,
>> > > > >>>
>> > > > >>> Riyafa
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Riyafa Abdul Hameed
>> > > > >> Undergraduate, University of Moratuwa
>> > > > >>
>> > > > >> Email: riyafa.12@cse.mrt.ac.lk
>> > > > >> Website: https://riyafa.wordpress.com/ <
>> > http://riyafa.wordpress.com/>
>> > > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/
>> riyafa
>> > >
>> > > > >> <http://twitter.com/Riyafa1>
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Riyafa Abdul Hameed
>> > Undergraduate, University of Moratuwa
>> >
>> > Email: riyafa.12@cse.mrt.ac.lk
>> > Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
>> > <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
>> > <http://twitter.com/Riyafa1>
>> >
>>
>
>
>
> --
> Riyafa Abdul Hameed
> Undergraduate, University of Moratuwa
>
> Email: riyafa.12@cse.mrt.ac.lk
> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> <http://twitter.com/Riyafa1>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>.

On 23 July 2017 at 22:59, Yingyi Bu <bu...@gmail.com> wrote:

> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>
> AVG:  that's the local function in the local plan.
> LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
> computation of average.  LOCAL_AVG aggregates the sum/count at the local
> data source, INTERMEDIATE_AVG aggregates the sum/count over partially
> aggregated sums/counts, and GLOBAL_AVG computes the final average value
> from intermediate sums/counts.
>

How do we decide if we need these descriptors? COUNT seems to have only
one descriptor


>
> Best,
> Yingyi
>
>
> On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
> riyafa.12@cse.mrt.ac.lk> wrote:
>
> > Hi,
> >
> > Thanks for the explanation.
> > But there are so many things I still don't understand. One of them is for
> > the avg function itself there are several FuntionIdentifiers. What do
> they
> > all mean?
> >
> > I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
> >
> > What do they all mean?
> > Please help
> >
> > On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
> >
> > > Hi Riyafa,
> > >
> > >    >> ScalarCountAggregateDescriptor
> > >   It's used for counting a scalar array that appears inside a tuple.
> > >   For example:
> > >   SELECT u.id, array_count(u.friends)
> > >   FROM users u;
> > >
> > >    >> SerializableCountAggregateDescriptor
> > >    Serialized aggregation descriptor implementations are only used in
> > > hash-based group-by.
> > >    For example:
> > >    SELECT u.city, count(*)
> > >    FROM users u
> > >    /*+ hash */
> > >    GROUP BY u.city;
> > >
> > >   If your aggregation function doesn't have a fixed-byte-sized state,
> you
> > > don't need to worry about that or implement that.
> > >
> > >    >> CountAggregateDescriptor
> > >    This is used in group-by or global aggregate:
> > >    For example:
> > >    SELECT u.city, count(*)
> > >    FROM users u
> > >    GROUP BY u.city;
> > >
> > >    SELECT count(*) FROM users;
> > >
> > >
> > > Best,
> > > Yingyi
> > >
> > >
> > > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <
> riyafa@apache.org>
> > > wrote:
> > >
> > > > Hi again,
> > > >
> > > > Any suggestions on this? Or anyone I can reach to who are not on this
> > > list
> > > > or not active on the list?
> > > >
> > > > Thank you.
> > > >
> > > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
> > wrote:
> > > >
> > > > > Hi again,
> > > > >
> > > > > I think I can understand how to write the descriptor in the
> packages:
> > > > > org.apache.asterix.runtime.aggregates.std and
> > > > org.apache.asterix.runtime.aggregates.scalar.
> > > > > But I am not sure I understand how to write the descriptor in the
> > > > package:
> > > > > org.apache.asterix.runtime.aggregates.serializable.std  because it
> > > > > requires setting a state in the init function that doesn't seem to
> > > have a
> > > > > pattern in the other descriptors.
> > > > > Also I don't seem to understand the reasons for implementing each
> of
> > > > these
> > > > > descriptors for the aggregate functions.
> > > > >
> > > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
> > riyafa.12@cse.mrt.ac.lk
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I meant any explanation on the implementation of aggregate
> functions
> > > in
> > > > >> AsterixDB would be highly appreciated.
> > > > >>
> > > > >> Thank you.
> > > > >> Yours sincerely,
> > > > >> Riyafa
> > > > >>
> > > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
> > > > wrote:
> > > > >>
> > > > >>> Dear all,
> > > > >>>
> > > > >>> I am trying to create aggregate functions and I see there are
> more
> > > than
> > > > >>> one function descriptors for one single function.
> > > > >>> For example the function array_count(collection) has the
> following
> > > > >>> descriptors:
> > > > >>>
> > > > >>>
> > > > >>>    - ScalarCountAggregateDescriptor
> > > > >>>    - SerializableCountAggregateDescriptor
> > > > >>>    - CountAggregateDescriptor
> > > > >>>
> > > > >>> I am not sure I understand the difference between each of this.
> Can
> > > you
> > > > >>> please provide and example or point me to a documentation entry
> to
> > > > learn
> > > > >>> how to properly implement aggregate functions?
> > > > >>>
> > > > >>> The function I am trying to implement is ST_Extent.
> > > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> > > > >>>
> > > > >>> Thank you.
> > > > >>>
> > > > >>> Yours sincerely,
> > > > >>>
> > > > >>> Riyafa
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Riyafa Abdul Hameed
> > > > >> Undergraduate, University of Moratuwa
> > > > >>
> > > > >> Email: riyafa.12@cse.mrt.ac.lk
> > > > >> Website: https://riyafa.wordpress.com/ <
> > http://riyafa.wordpress.com/>
> > > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/
> riyafa
> > >
> > > > >> <http://twitter.com/Riyafa1>
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Riyafa Abdul Hameed
> > Undergraduate, University of Moratuwa
> >
> > Email: riyafa.12@cse.mrt.ac.lk
> > Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> > <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> > <http://twitter.com/Riyafa1>
> >
>



-- 
Riyafa Abdul Hameed
Undergraduate, University of Moratuwa

Email: riyafa.12@cse.mrt.ac.lk
Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
<http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
<http://twitter.com/Riyafa1>

Re: Creating aggregate functions

Posted by Yingyi Bu <bu...@gmail.com>.

>> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.

AVG:  that's the local function in the local plan.
LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
computation of average.  LOCAL_AVG aggregates the sum/count at the local
data source, INTERMEDIATE_AVG aggregates the sum/count over partially
aggregated sums/counts, and GLOBAL_AVG computes the final average value
from intermediate sums/counts.

Best,
Yingyi


On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
riyafa.12@cse.mrt.ac.lk> wrote:

> Hi,
>
> Thanks for the explanation.
> But there are so many things I still don't understand. One of them is for
> the avg function itself there are several FuntionIdentifiers. What do they
> all mean?
>
> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>
> What do they all mean?
> Please help
>
> On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:
>
> > Hi Riyafa,
> >
> >    >> ScalarCountAggregateDescriptor
> >   It's used for counting a scalar array that appears inside a tuple.
> >   For example:
> >   SELECT u.id, array_count(u.friends)
> >   FROM users u;
> >
> >    >> SerializableCountAggregateDescriptor
> >    Serialized aggregation descriptor implementations are only used in
> > hash-based group-by.
> >    For example:
> >    SELECT u.city, count(*)
> >    FROM users u
> >    /*+ hash */
> >    GROUP BY u.city;
> >
> >   If your aggregation function doesn't have a fixed-byte-sized state, you
> > don't need to worry about that or implement that.
> >
> >    >> CountAggregateDescriptor
> >    This is used in group-by or global aggregate:
> >    For example:
> >    SELECT u.city, count(*)
> >    FROM users u
> >    GROUP BY u.city;
> >
> >    SELECT count(*) FROM users;
> >
> >
> > Best,
> > Yingyi
> >
> >
> > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <ri...@apache.org>
> > wrote:
> >
> > > Hi again,
> > >
> > > Any suggestions on this? Or anyone I can reach to who are not on this
> > list
> > > or not active on the list?
> > >
> > > Thank you.
> > >
> > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org>
> wrote:
> > >
> > > > Hi again,
> > > >
> > > > I think I can understand how to write the descriptor in the packages:
> > > > org.apache.asterix.runtime.aggregates.std and
> > > org.apache.asterix.runtime.aggregates.scalar.
> > > > But I am not sure I understand how to write the descriptor in the
> > > package:
> > > > org.apache.asterix.runtime.aggregates.serializable.std  because it
> > > > requires setting a state in the init function that doesn't seem to
> > have a
> > > > pattern in the other descriptors.
> > > > Also I don't seem to understand the reasons for implementing each of
> > > these
> > > > descriptors for the aggregate functions.
> > > >
> > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
> riyafa.12@cse.mrt.ac.lk
> > >
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I meant any explanation on the implementation of aggregate functions
> > in
> > > >> AsterixDB would be highly appreciated.
> > > >>
> > > >> Thank you.
> > > >> Yours sincerely,
> > > >> Riyafa
> > > >>
> > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
> > > wrote:
> > > >>
> > > >>> Dear all,
> > > >>>
> > > >>> I am trying to create aggregate functions and I see there are more
> > than
> > > >>> one function descriptors for one single function.
> > > >>> For example the function array_count(collection) has the following
> > > >>> descriptors:
> > > >>>
> > > >>>
> > > >>>    - ScalarCountAggregateDescriptor
> > > >>>    - SerializableCountAggregateDescriptor
> > > >>>    - CountAggregateDescriptor
> > > >>>
> > > >>> I am not sure I understand the difference between each of this. Can
> > you
> > > >>> please provide and example or point me to a documentation entry to
> > > learn
> > > >>> how to properly implement aggregate functions?
> > > >>>
> > > >>> The function I am trying to implement is ST_Extent.
> > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> > > >>>
> > > >>> Thank you.
> > > >>>
> > > >>> Yours sincerely,
> > > >>>
> > > >>> Riyafa
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Riyafa Abdul Hameed
> > > >> Undergraduate, University of Moratuwa
> > > >>
> > > >> Email: riyafa.12@cse.mrt.ac.lk
> > > >> Website: https://riyafa.wordpress.com/ <
> http://riyafa.wordpress.com/>
> > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa
> >
> > > >> <http://twitter.com/Riyafa1>
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Riyafa Abdul Hameed
> Undergraduate, University of Moratuwa
>
> Email: riyafa.12@cse.mrt.ac.lk
> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> <http://twitter.com/Riyafa1>
>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>.

Hi,

Thanks for the explanation.
But there are so many things I still don't understand. One of them is for
the avg function itself there are several FuntionIdentifiers. What do they
all mean?

I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.

What do they all mean?
Please help

On 19 July 2017 at 21:56, Yingyi Bu <bu...@gmail.com> wrote:

> Hi Riyafa,
>
>    >> ScalarCountAggregateDescriptor
>   It's used for counting a scalar array that appears inside a tuple.
>   For example:
>   SELECT u.id, array_count(u.friends)
>   FROM users u;
>
>    >> SerializableCountAggregateDescriptor
>    Serialized aggregation descriptor implementations are only used in
> hash-based group-by.
>    For example:
>    SELECT u.city, count(*)
>    FROM users u
>    /*+ hash */
>    GROUP BY u.city;
>
>   If your aggregation function doesn't have a fixed-byte-sized state, you
> don't need to worry about that or implement that.
>
>    >> CountAggregateDescriptor
>    This is used in group-by or global aggregate:
>    For example:
>    SELECT u.city, count(*)
>    FROM users u
>    GROUP BY u.city;
>
>    SELECT count(*) FROM users;
>
>
> Best,
> Yingyi
>
>
> On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <ri...@apache.org>
> wrote:
>
> > Hi again,
> >
> > Any suggestions on this? Or anyone I can reach to who are not on this
> list
> > or not active on the list?
> >
> > Thank you.
> >
> > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org> wrote:
> >
> > > Hi again,
> > >
> > > I think I can understand how to write the descriptor in the packages:
> > > org.apache.asterix.runtime.aggregates.std and
> > org.apache.asterix.runtime.aggregates.scalar.
> > > But I am not sure I understand how to write the descriptor in the
> > package:
> > > org.apache.asterix.runtime.aggregates.serializable.std  because it
> > > requires setting a state in the init function that doesn't seem to
> have a
> > > pattern in the other descriptors.
> > > Also I don't seem to understand the reasons for implementing each of
> > these
> > > descriptors for the aggregate functions.
> > >
> > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <riyafa.12@cse.mrt.ac.lk
> >
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I meant any explanation on the implementation of aggregate functions
> in
> > >> AsterixDB would be highly appreciated.
> > >>
> > >> Thank you.
> > >> Yours sincerely,
> > >> Riyafa
> > >>
> > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
> > wrote:
> > >>
> > >>> Dear all,
> > >>>
> > >>> I am trying to create aggregate functions and I see there are more
> than
> > >>> one function descriptors for one single function.
> > >>> For example the function array_count(collection) has the following
> > >>> descriptors:
> > >>>
> > >>>
> > >>>    - ScalarCountAggregateDescriptor
> > >>>    - SerializableCountAggregateDescriptor
> > >>>    - CountAggregateDescriptor
> > >>>
> > >>> I am not sure I understand the difference between each of this. Can
> you
> > >>> please provide and example or point me to a documentation entry to
> > learn
> > >>> how to properly implement aggregate functions?
> > >>>
> > >>> The function I am trying to implement is ST_Extent.
> > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> > >>>
> > >>> Thank you.
> > >>>
> > >>> Yours sincerely,
> > >>>
> > >>> Riyafa
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Riyafa Abdul Hameed
> > >> Undergraduate, University of Moratuwa
> > >>
> > >> Email: riyafa.12@cse.mrt.ac.lk
> > >> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> > >> <http://twitter.com/Riyafa1>
> > >>
> > >
> > >
> >
>



-- 
Riyafa Abdul Hameed
Undergraduate, University of Moratuwa

Email: riyafa.12@cse.mrt.ac.lk
Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
<http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
<http://twitter.com/Riyafa1>

Re: Creating aggregate functions

Posted by Yingyi Bu <bu...@gmail.com>.

Hi Riyafa,

   >> ScalarCountAggregateDescriptor
  It's used for counting a scalar array that appears inside a tuple.
  For example:
  SELECT u.id, array_count(u.friends)
  FROM users u;

   >> SerializableCountAggregateDescriptor
   Serialized aggregation descriptor implementations are only used in
hash-based group-by.
   For example:
   SELECT u.city, count(*)
   FROM users u
   /*+ hash */
   GROUP BY u.city;

  If your aggregation function doesn't have a fixed-byte-sized state, you
don't need to worry about that or implement that.

   >> CountAggregateDescriptor
   This is used in group-by or global aggregate:
   For example:
   SELECT u.city, count(*)
   FROM users u
   GROUP BY u.city;

   SELECT count(*) FROM users;


Best,
Yingyi


On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <ri...@apache.org>
wrote:

> Hi again,
>
> Any suggestions on this? Or anyone I can reach to who are not on this list
> or not active on the list?
>
> Thank you.
>
> On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org> wrote:
>
> > Hi again,
> >
> > I think I can understand how to write the descriptor in the packages:
> > org.apache.asterix.runtime.aggregates.std and
> org.apache.asterix.runtime.aggregates.scalar.
> > But I am not sure I understand how to write the descriptor in the
> package:
> > org.apache.asterix.runtime.aggregates.serializable.std  because it
> > requires setting a state in the init function that doesn't seem to have a
> > pattern in the other descriptors.
> > Also I don't seem to understand the reasons for implementing each of
> these
> > descriptors for the aggregate functions.
> >
> > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>
> > wrote:
> >
> >> Hi all,
> >>
> >> I meant any explanation on the implementation of aggregate functions in
> >> AsterixDB would be highly appreciated.
> >>
> >> Thank you.
> >> Yours sincerely,
> >> Riyafa
> >>
> >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org>
> wrote:
> >>
> >>> Dear all,
> >>>
> >>> I am trying to create aggregate functions and I see there are more than
> >>> one function descriptors for one single function.
> >>> For example the function array_count(collection) has the following
> >>> descriptors:
> >>>
> >>>
> >>>    - ScalarCountAggregateDescriptor
> >>>    - SerializableCountAggregateDescriptor
> >>>    - CountAggregateDescriptor
> >>>
> >>> I am not sure I understand the difference between each of this. Can you
> >>> please provide and example or point me to a documentation entry to
> learn
> >>> how to properly implement aggregate functions?
> >>>
> >>> The function I am trying to implement is ST_Extent.
> >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
> >>>
> >>> Thank you.
> >>>
> >>> Yours sincerely,
> >>>
> >>> Riyafa
> >>>
> >>
> >>
> >>
> >> --
> >> Riyafa Abdul Hameed
> >> Undergraduate, University of Moratuwa
> >>
> >> Email: riyafa.12@cse.mrt.ac.lk
> >> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> >> <http://twitter.com/Riyafa1>
> >>
> >
> >
>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@apache.org>.

Hi again,

Any suggestions on this? Or anyone I can reach to who are not on this list
or not active on the list?

Thank you.

On 17 July 2017 at 17:18, Riyafa Abdul Hameed <ri...@apache.org> wrote:

> Hi again,
>
> I think I can understand how to write the descriptor in the packages:
> org.apache.asterix.runtime.aggregates.std and  org.apache.asterix.runtime.aggregates.scalar.
> But I am not sure I understand how to write the descriptor in the package:
> org.apache.asterix.runtime.aggregates.serializable.std  because it
> requires setting a state in the init function that doesn't seem to have a
> pattern in the other descriptors.
> Also I don't seem to understand the reasons for implementing each of these
> descriptors for the aggregate functions.
>
> On 17 July 2017 at 16:56, Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>
> wrote:
>
>> Hi all,
>>
>> I meant any explanation on the implementation of aggregate functions in
>> AsterixDB would be highly appreciated.
>>
>> Thank you.
>> Yours sincerely,
>> Riyafa
>>
>> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org> wrote:
>>
>>> Dear all,
>>>
>>> I am trying to create aggregate functions and I see there are more than
>>> one function descriptors for one single function.
>>> For example the function array_count(collection) has the following
>>> descriptors:
>>>
>>>
>>>    - ScalarCountAggregateDescriptor
>>>    - SerializableCountAggregateDescriptor
>>>    - CountAggregateDescriptor
>>>
>>> I am not sure I understand the difference between each of this. Can you
>>> please provide and example or point me to a documentation entry to learn
>>> how to properly implement aggregate functions?
>>>
>>> The function I am trying to implement is ST_Extent.
>>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>>>
>>> Thank you.
>>>
>>> Yours sincerely,
>>>
>>> Riyafa
>>>
>>
>>
>>
>> --
>> Riyafa Abdul Hameed
>> Undergraduate, University of Moratuwa
>>
>> Email: riyafa.12@cse.mrt.ac.lk
>> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
>> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
>> <http://twitter.com/Riyafa1>
>>
>
>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@apache.org>.

Hi again,

I think I can understand how to write the descriptor in the packages:
org.apache.asterix.runtime.aggregates.std and
org.apache.asterix.runtime.aggregates.scalar. But I am not sure I
understand how to write the descriptor in the package:
org.apache.asterix.runtime.aggregates.serializable.std  because it requires
setting a state in the init function that doesn't seem to have a pattern in
the other descriptors.
Also I don't seem to understand the reasons for implementing each of these
descriptors for the aggregate functions.

On 17 July 2017 at 16:56, Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>
wrote:

> Hi all,
>
> I meant any explanation on the implementation of aggregate functions in
> AsterixDB would be highly appreciated.
>
> Thank you.
> Yours sincerely,
> Riyafa
>
> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org> wrote:
>
>> Dear all,
>>
>> I am trying to create aggregate functions and I see there are more than
>> one function descriptors for one single function.
>> For example the function array_count(collection) has the following
>> descriptors:
>>
>>
>>    - ScalarCountAggregateDescriptor
>>    - SerializableCountAggregateDescriptor
>>    - CountAggregateDescriptor
>>
>> I am not sure I understand the difference between each of this. Can you
>> please provide and example or point me to a documentation entry to learn
>> how to properly implement aggregate functions?
>>
>> The function I am trying to implement is ST_Extent.
>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>>
>> Thank you.
>>
>> Yours sincerely,
>>
>> Riyafa
>>
>
>
>
> --
> Riyafa Abdul Hameed
> Undergraduate, University of Moratuwa
>
> Email: riyafa.12@cse.mrt.ac.lk
> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> <http://twitter.com/Riyafa1>
>

Re: Creating aggregate functions

Posted by Riyafa Abdul Hameed <ri...@cse.mrt.ac.lk>.

Hi all,

I meant any explanation on the implementation of aggregate functions in
AsterixDB would be highly appreciated.

Thank you.
Yours sincerely,
Riyafa

On 16 July 2017 at 08:01, Riyafa Abdul Hameed <ri...@apache.org> wrote:

> Dear all,
>
> I am trying to create aggregate functions and I see there are more than
> one function descriptors for one single function.
> For example the function array_count(collection) has the following
> descriptors:
>
>
>    - ScalarCountAggregateDescriptor
>    - SerializableCountAggregateDescriptor
>    - CountAggregateDescriptor
>
> I am not sure I understand the difference between each of this. Can you
> please provide and example or point me to a documentation entry to learn
> how to properly implement aggregate functions?
>
> The function I am trying to implement is ST_Extent.
> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>
> Thank you.
>
> Yours sincerely,
>
> Riyafa
>



-- 
Riyafa Abdul Hameed
Undergraduate, University of Moratuwa

Email: riyafa.12@cse.mrt.ac.lk
Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
<http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
<http://twitter.com/Riyafa1>