You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Fan Liya <li...@gmail.com> on 2020/10/13 07:54:29 UTC

DISCUSS: the concept of duplicate insensitive aggregate functions

Hi all,

I would like to introduce the idea of duplicate insensitive aggregate
functions.

For such functions, the aggregation results remain the same even after
deduplication.

For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
aggregation results of MIN are the same regardless of whether we perform
data deduplication first. That is,

MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})

So MIN is a *deduplicate insensitive function*.

On the other hand, function SUM is not duplicate insensitive, because

 SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})

The concept of deduplicate insensitiveness can help us in many optimization
scenarios.

For example, the curent implementation of AggregateMergeRule rules out any
aggregate calls for which the isDistict() method returns true. However, for
duplicate insensitive functions, the rule should be applicable.

Could you please give your valuable feedback?

Best,
Liya Fan

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Fan Liya <li...@gmail.com>.

Hi Julian,

Please see my comments inline:

> I suspect that any duplicate-insensitive function is very easily
> splittable. Especially ones that are their own rollup (which is true
> of min, max, any_value, single_value).

I agree. However, the opposite is not true. For example, COUNT can be
split into a COUNT and a SUM, but COUNT is not duplicate-insensitive.

> If getDistinctOptionality() returns IGNORED, does the author of the
> function need to write any further code to ensure that calls can be
> split?

Yes. We need to override the T unwrap(Class<T> clazz) method, and
make it return an instance of SqlSplittableAggFunction.

> Lastly, I'll note that other systems (e.g. Algebird [1]) allow you to
> assign aggregate functions to algebraic structures - for example, sum
> and hyperLogLog are both monoids [2] and can therefore be 'rolled up'.
> We could do the same in Calcite. We could perhaps allow aggregate
> functions to declare themselves as belonging to a particular algebraic
> structure, and we could exploit the properties of those structures to
> perform optimizations.

Thanks for the information. It is super cool!
I would be glad to see algebraic structures introduced to Calcite, and
strongly believe that they will help us in more optimization scenarios.

Best,
Liya Fan

On Wed, Oct 14, 2020 at 1:19 PM Julian Hyde <jh...@apache.org> wrote:

> I suspect that any duplicate-insensitive function is very easily
> splittable. Especially ones that are their own rollup (which is true
> of min, max, any_value, single_value).
>
> If getDistinctOptionality() returns IGNORED, does the author of the
> function need to write any further code to ensure that calls can be
> split?
>
> Lastly, I'll note that other systems (e.g. Algebird [1]) allow you to
> assign aggregate functions to algebraic structures - for example, sum
> and hyperLogLog are both monoids [2] and can therefore be 'rolled up'.
> We could do the same in Calcite. We could perhaps allow aggregate
> functions to declare themselves as belonging to a particular algebraic
> structure, and we could exploit the properties of those structures to
> perform optimizations.
>
> Julian
>
> [1] https://twitter.github.io/algebird/
>
> [2] https://en.wikipedia.org/wiki/Monoid
>
> On Tue, Oct 13, 2020 at 7:26 PM Fan Liya <li...@gmail.com> wrote:
> >
> > Hi Julian,
> >
> > Thanks again for your feedback.
> >
> > Since they are duplicate-insensitive, they should also be splittable
> > (SqlSplittableAggFunction), just like min/max, etc.
> > What do you think?
> >
> > I want to fire a JIRA accordingly, so that more optimizations can be
> > applied.
> > Any feedback is appreciated.
> >
> > Best,
> > Liya Fan
> >
> >
> >
> > On Wed, Oct 14, 2020 at 2:59 AM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > > I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive.
> > >
> > > > On Oct 13, 2020, at 2:17 AM, Fan Liya <li...@gmail.com> wrote:
> > > >
> > > > Hi Julian,
> > > >
> > > > Thanks a lot for your feedback.
> > > > I think SqlAggFunction.getDistinctOptionality() is exactly what I
> > > > am looking for.
> > > >
> > > > BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category
> of
> > > > duplicate insensitive functions.
> > > > What do you think?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > >
> > > >
> > > > On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > > >
> > > >> We already have this concept. See
> > > SqlAggFunction.getDistinctOptionality(),
> > > >> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
> > > >> https://issues.apache.org/jira/browse/CALCITE-3159>.
> > > >>
> > > >> Julian
> > > >>
> > > >>
> > > >>> On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com>
> wrote:
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I would like to introduce the idea of duplicate insensitive
> aggregate
> > > >>> functions.
> > > >>>
> > > >>> For such functions, the aggregation results remain the same even
> after
> > > >>> deduplication.
> > > >>>
> > > >>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> > > >>> aggregation results of MIN are the same regardless of whether we
> > > perform
> > > >>> data deduplication first. That is,
> > > >>>
> > > >>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> > > >>>
> > > >>> So MIN is a *deduplicate insensitive function*.
> > > >>>
> > > >>> On the other hand, function SUM is not duplicate insensitive,
> because
> > > >>>
> > > >>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> > > >>>
> > > >>> The concept of deduplicate insensitiveness can help us in many
> > > >> optimization
> > > >>> scenarios.
> > > >>>
> > > >>> For example, the curent implementation of AggregateMergeRule rules
> out
> > > >> any
> > > >>> aggregate calls for which the isDistict() method returns true.
> However,
> > > >> for
> > > >>> duplicate insensitive functions, the rule should be applicable.
> > > >>>
> > > >>> Could you please give your valuable feedback?
> > > >>>
> > > >>> Best,
> > > >>> Liya Fan
> > > >>
> > > >>
> > >
> > >
>

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Julian Hyde <jh...@apache.org>.

I suspect that any duplicate-insensitive function is very easily
splittable. Especially ones that are their own rollup (which is true
of min, max, any_value, single_value).

If getDistinctOptionality() returns IGNORED, does the author of the
function need to write any further code to ensure that calls can be
split?

Lastly, I'll note that other systems (e.g. Algebird [1]) allow you to
assign aggregate functions to algebraic structures - for example, sum
and hyperLogLog are both monoids [2] and can therefore be 'rolled up'.
We could do the same in Calcite. We could perhaps allow aggregate
functions to declare themselves as belonging to a particular algebraic
structure, and we could exploit the properties of those structures to
perform optimizations.

Julian

[1] https://twitter.github.io/algebird/

[2] https://en.wikipedia.org/wiki/Monoid

On Tue, Oct 13, 2020 at 7:26 PM Fan Liya <li...@gmail.com> wrote:
>
> Hi Julian,
>
> Thanks again for your feedback.
>
> Since they are duplicate-insensitive, they should also be splittable
> (SqlSplittableAggFunction), just like min/max, etc.
> What do you think?
>
> I want to fire a JIRA accordingly, so that more optimizations can be
> applied.
> Any feedback is appreciated.
>
> Best,
> Liya Fan
>
>
>
> On Wed, Oct 14, 2020 at 2:59 AM Julian Hyde <jh...@gmail.com> wrote:
>
> > I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive.
> >
> > > On Oct 13, 2020, at 2:17 AM, Fan Liya <li...@gmail.com> wrote:
> > >
> > > Hi Julian,
> > >
> > > Thanks a lot for your feedback.
> > > I think SqlAggFunction.getDistinctOptionality() is exactly what I
> > > am looking for.
> > >
> > > BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of
> > > duplicate insensitive functions.
> > > What do you think?
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > >
> > > On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <jh...@gmail.com>
> > wrote:
> > >
> > >> We already have this concept. See
> > SqlAggFunction.getDistinctOptionality(),
> > >> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
> > >> https://issues.apache.org/jira/browse/CALCITE-3159>.
> > >>
> > >> Julian
> > >>
> > >>
> > >>> On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I would like to introduce the idea of duplicate insensitive aggregate
> > >>> functions.
> > >>>
> > >>> For such functions, the aggregation results remain the same even after
> > >>> deduplication.
> > >>>
> > >>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> > >>> aggregation results of MIN are the same regardless of whether we
> > perform
> > >>> data deduplication first. That is,
> > >>>
> > >>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> > >>>
> > >>> So MIN is a *deduplicate insensitive function*.
> > >>>
> > >>> On the other hand, function SUM is not duplicate insensitive, because
> > >>>
> > >>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> > >>>
> > >>> The concept of deduplicate insensitiveness can help us in many
> > >> optimization
> > >>> scenarios.
> > >>>
> > >>> For example, the curent implementation of AggregateMergeRule rules out
> > >> any
> > >>> aggregate calls for which the isDistict() method returns true. However,
> > >> for
> > >>> duplicate insensitive functions, the rule should be applicable.
> > >>>
> > >>> Could you please give your valuable feedback?
> > >>>
> > >>> Best,
> > >>> Liya Fan
> > >>
> > >>
> >
> >

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Fan Liya <li...@gmail.com>.

Hi Julian,

Thanks again for your feedback.

Since they are duplicate-insensitive, they should also be splittable
(SqlSplittableAggFunction), just like min/max, etc.
What do you think?

I want to fire a JIRA accordingly, so that more optimizations can be
applied.
Any feedback is appreciated.

Best,
Liya Fan



On Wed, Oct 14, 2020 at 2:59 AM Julian Hyde <jh...@gmail.com> wrote:

> I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive.
>
> > On Oct 13, 2020, at 2:17 AM, Fan Liya <li...@gmail.com> wrote:
> >
> > Hi Julian,
> >
> > Thanks a lot for your feedback.
> > I think SqlAggFunction.getDistinctOptionality() is exactly what I
> > am looking for.
> >
> > BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of
> > duplicate insensitive functions.
> > What do you think?
> >
> > Best,
> > Liya Fan
> >
> >
> >
> > On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> >> We already have this concept. See
> SqlAggFunction.getDistinctOptionality(),
> >> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
> >> https://issues.apache.org/jira/browse/CALCITE-3159>.
> >>
> >> Julian
> >>
> >>
> >>> On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I would like to introduce the idea of duplicate insensitive aggregate
> >>> functions.
> >>>
> >>> For such functions, the aggregation results remain the same even after
> >>> deduplication.
> >>>
> >>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> >>> aggregation results of MIN are the same regardless of whether we
> perform
> >>> data deduplication first. That is,
> >>>
> >>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> >>>
> >>> So MIN is a *deduplicate insensitive function*.
> >>>
> >>> On the other hand, function SUM is not duplicate insensitive, because
> >>>
> >>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> >>>
> >>> The concept of deduplicate insensitiveness can help us in many
> >> optimization
> >>> scenarios.
> >>>
> >>> For example, the curent implementation of AggregateMergeRule rules out
> >> any
> >>> aggregate calls for which the isDistict() method returns true. However,
> >> for
> >>> duplicate insensitive functions, the rule should be applicable.
> >>>
> >>> Could you please give your valuable feedback?
> >>>
> >>> Best,
> >>> Liya Fan
> >>
> >>
>
>

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Julian Hyde <jh...@gmail.com>.

I agree. ANY_VALUE and SINGLE_VALUE are duplicate-insensitive.

> On Oct 13, 2020, at 2:17 AM, Fan Liya <li...@gmail.com> wrote:
> 
> Hi Julian,
> 
> Thanks a lot for your feedback.
> I think SqlAggFunction.getDistinctOptionality() is exactly what I
> am looking for.
> 
> BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of
> duplicate insensitive functions.
> What do you think?
> 
> Best,
> Liya Fan
> 
> 
> 
> On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <jh...@gmail.com> wrote:
> 
>> We already have this concept. See SqlAggFunction.getDistinctOptionality(),
>> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
>> https://issues.apache.org/jira/browse/CALCITE-3159>.
>> 
>> Julian
>> 
>> 
>>> On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> I would like to introduce the idea of duplicate insensitive aggregate
>>> functions.
>>> 
>>> For such functions, the aggregation results remain the same even after
>>> deduplication.
>>> 
>>> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
>>> aggregation results of MIN are the same regardless of whether we perform
>>> data deduplication first. That is,
>>> 
>>> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
>>> 
>>> So MIN is a *deduplicate insensitive function*.
>>> 
>>> On the other hand, function SUM is not duplicate insensitive, because
>>> 
>>> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
>>> 
>>> The concept of deduplicate insensitiveness can help us in many
>> optimization
>>> scenarios.
>>> 
>>> For example, the curent implementation of AggregateMergeRule rules out
>> any
>>> aggregate calls for which the isDistict() method returns true. However,
>> for
>>> duplicate insensitive functions, the rule should be applicable.
>>> 
>>> Could you please give your valuable feedback?
>>> 
>>> Best,
>>> Liya Fan
>> 
>>

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Fan Liya <li...@gmail.com>.

Hi Julian,

Thanks a lot for your feedback.
I think SqlAggFunction.getDistinctOptionality() is exactly what I
am looking for.

BTW, I think ANY_VALUE and SINGLE_VALUE also belong to the category of
duplicate insensitive functions.
What do you think?

Best,
Liya Fan



On Tue, Oct 13, 2020 at 4:55 PM Julian Hyde <jh...@gmail.com> wrote:

> We already have this concept. See SqlAggFunction.getDistinctOptionality(),
> added in https://issues.apache.org/jira/browse/CALCITE-3159 <
> https://issues.apache.org/jira/browse/CALCITE-3159>.
>
> Julian
>
>
> > On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I would like to introduce the idea of duplicate insensitive aggregate
> > functions.
> >
> > For such functions, the aggregation results remain the same even after
> > deduplication.
> >
> > For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> > aggregation results of MIN are the same regardless of whether we perform
> > data deduplication first. That is,
> >
> > MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> >
> > So MIN is a *deduplicate insensitive function*.
> >
> > On the other hand, function SUM is not duplicate insensitive, because
> >
> > SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> >
> > The concept of deduplicate insensitiveness can help us in many
> optimization
> > scenarios.
> >
> > For example, the curent implementation of AggregateMergeRule rules out
> any
> > aggregate calls for which the isDistict() method returns true. However,
> for
> > duplicate insensitive functions, the rule should be applicable.
> >
> > Could you please give your valuable feedback?
> >
> > Best,
> > Liya Fan
>
>

Re: DISCUSS: the concept of duplicate insensitive aggregate functions

Posted by Julian Hyde <jh...@gmail.com>.

We already have this concept. See SqlAggFunction.getDistinctOptionality(), added in https://issues.apache.org/jira/browse/CALCITE-3159 <https://issues.apache.org/jira/browse/CALCITE-3159>.

Julian


> On Oct 13, 2020, at 12:54 AM, Fan Liya <li...@gmail.com> wrote:
> 
> Hi all,
> 
> I would like to introduce the idea of duplicate insensitive aggregate
> functions.
> 
> For such functions, the aggregation results remain the same even after
> deduplication.
> 
> For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the
> aggregation results of MIN are the same regardless of whether we perform
> data deduplication first. That is,
> 
> MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5})
> 
> So MIN is a *deduplicate insensitive function*.
> 
> On the other hand, function SUM is not duplicate insensitive, because
> 
> SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5})
> 
> The concept of deduplicate insensitiveness can help us in many optimization
> scenarios.
> 
> For example, the curent implementation of AggregateMergeRule rules out any
> aggregate calls for which the isDistict() method returns true. However, for
> duplicate insensitive functions, the rule should be applicable.
> 
> Could you please give your valuable feedback?
> 
> Best,
> Liya Fan