You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Julian Hyde <ju...@gmail.com> on 2013/01/28 19:30:46 UTC

Moving & regular aggregates

I think it's a mistake to use the same operator for regular and moving aggregates. (Moving aggregates are also known as running aggregates. There are sub-types called sliding and paged.)

An regular aggregate would be "Compute the total sales for each product each month".

A moving aggregate would be "For each sales order, compute the sales of that product in that region over the past 20 days".

Consider their output. Regular aggregates output the grouping keys and the aggregates. They can't output the input rows because they have been aggregated into a single group. Moving aggregates output the original rows PLUS any aggregates they compute.

Consider how they are specified. Regular aggregates are specified by a set of grouping keys, and a set of aggregate functions. Moving aggregates are specified by the grouping keys (called partition keys in the SQL standard, for what it's worth) but also specifications of ordering (for rank etc.) and window length (10 rows, or 2 hours).

Consider their internals. Regular aggregates typically use a hashmap. Moving aggregates typically use a hashmap but also make heavy use of sorted lists.

Given this, I would separate the aggregate operator into a GroupBy operator and a MovingAggregate operator. (The MovingAggregate operator might have sub-types such as sliding and paged, as I mentioned above.)

By the way, for both types of aggregates, it makes sense to have an empty set of grouping keys. So, Ted is right on the money with https://issues.apache.org/jira/browse/DRILL-22.

Julian

Re: Moving & regular aggregates

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jan 28, 2013 at 6:03 PM, Jacques Nadeau <ja...@gmail.com>wrote:

> ...
> How do people feel about CollapseAggregate and RunningAggregate.  I'm
> inclined to stay away from the GroupBy name since a traditional SQL group
> by is really Group followed by CollapseAggregate.  I've also been
> considering using the MS SQL Server naming of "Segment" instead of "Group".
>  Anyone have any opinions on that?
>

Segment has worked for me as I was reading the syntax document.  It has an
intuitive meaning and little semantic collision with other concepts.

> > OK.  We need a new name.  Nominations are open.
> >
>

> CollapseAggregate is my vote.
>

Fine by me.

> >
> > And there is a third kind which is the running aggregate, but those are
> > plausibly an windowed aggregate with infinite extent backwards.
> >
>
> One option of the windowing operator is a full backward window.
>

Yes.

The running aggregate can handle all of these cases.

> > You raise an interesting point here.  The current argument structure is
> > deficient.  We currently have before and after.  I think that should be
> > restated to start and end indexes with negative indexes to indicate
> > preceding records.  Your point here implies that we should also have a
> > starting expression and an ending expression.
>

> I'm not sure we're deficient.  The before and after are based within the
> segment key.
>

I really think that we are deficient.  There is a much simpler
representation that I outlined in my comments to the syntax document.
Basically, all of the kinds of windows that are based on indexes can easily
be handled with one structure that is based on index limits.  Much simpler
than what is currently specified.

> > Should I take a stab at a revised specification for aggregation?  I
> dislike
> > the groupBy name for an aggregation, but could be convinced by a show of
> > hands.
> >
>
> I can take a shot at this.
>

Great.

Re: Moving & regular aggregates

Posted by Jacques Nadeau <ja...@gmail.com>.

I've been trying to clean up the syntax doc some.  I saw Ted added a bunch
of comments.  I'll go through and update the operators so that we have
discrete aggregation operators.

How do people feel about CollapseAggregate and RunningAggregate.  I'm
inclined to stay away from the GroupBy name since a traditional SQL group
by is really Group followed by CollapseAggregate.  I've also been
considering using the MS SQL Server naming of "Segment" instead of "Group".
 Anyone have any opinions on that?

Other comments below...


On Mon, Jan 28, 2013 at 11:41 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jan 28, 2013 at 10:30 AM, Julian Hyde <ju...@gmail.com>
> wrote:
>
> > I think it's a mistake to use the same operator for regular and moving
> > aggregates. (Moving aggregates are also known as running aggregates.
> There
> > are sub-types called sliding and paged.)
> >
>
> I think that we need a distinction.  Different operator is fine.  Flag is
> fine.  I tend toward different operator, not because of the different kind
> of output but rather because of the different argument pattern.  Same
> answer, different rationale.
>
>
> > An regular aggregate would be "Compute the total sales for each product
> > each month".
> >
>
> OK.  We call this aggregate now.  The argument is a segment reference in
> the current logical plan.
>
>
> > A moving aggregate would be "For each sales order, compute the sales of
> > that product in that region over the past 20 days".
> >
>
> OK.  We need a new name.  Nominations are open.
>
> >> CollapseAggregate is my vote.


>
> > Consider their output. Regular aggregates output the grouping keys and
> the
> > aggregates. They can't output the input rows because they have been
> > aggregated into a single group. Moving aggregates output the original
> rows
> > PLUS any aggregates they compute.
> >
>

>> Agreed.  Let's split htis out.


>
> And there is a third kind which is the running aggregate, but those are
> plausibly an windowed aggregate with infinite extent backwards.
>

>> One option of the windowing operator is a full backward window.

>
> Consider how they are specified. Regular aggregates are specified by a set
> > of grouping keys, and a set of aggregate functions.  Moving aggregates
> are
> > specified by the grouping keys (called partition keys in the SQL
> standard,
> > for what it's worth) but also specifications of ordering (for rank etc.)
> > and window length (10 rows, or 2 hours).
> >
>
>
I've using segment key as partition key.  Primarily because partitions
often means something else to a lot of people...


> You raise an interesting point here.  The current argument structure is
> deficient.  We currently have before and after.  I think that should be
> restated to start and end indexes with negative indexes to indicate
> preceding records.  Your point here implies that we should also have a
> starting expression and an ending expression.
>
>
>> I'm not sure we're deficient.  The before and after are based within the
segment key.

>
> > Given this, I would separate the aggregate operator into a GroupBy
> > operator and a MovingAggregate operator. (The MovingAggregate operator
> > might have sub-types such as sliding and paged, as I mentioned above.)
>
>
> Should I take a stab at a revised specification for aggregation?  I dislike
> the groupBy name for an aggregation, but could be convinced by a show of
> hands.
>

>>I can take a shot at this.

Re: Moving & regular aggregates

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jan 28, 2013 at 10:30 AM, Julian Hyde <ju...@gmail.com> wrote:

> I think it's a mistake to use the same operator for regular and moving
> aggregates. (Moving aggregates are also known as running aggregates. There
> are sub-types called sliding and paged.)
>

I think that we need a distinction.  Different operator is fine.  Flag is
fine.  I tend toward different operator, not because of the different kind
of output but rather because of the different argument pattern.  Same
answer, different rationale.

> An regular aggregate would be "Compute the total sales for each product
> each month".
>

OK.  We call this aggregate now.  The argument is a segment reference in
the current logical plan.

> A moving aggregate would be "For each sales order, compute the sales of
> that product in that region over the past 20 days".
>

OK.  We need a new name.  Nominations are open.

> Consider their output. Regular aggregates output the grouping keys and the
> aggregates. They can't output the input rows because they have been
> aggregated into a single group. Moving aggregates output the original rows
> PLUS any aggregates they compute.
>

And there is a third kind which is the running aggregate, but those are
plausibly an windowed aggregate with infinite extent backwards.

Consider how they are specified. Regular aggregates are specified by a set
> of grouping keys, and a set of aggregate functions.  Moving aggregates are
> specified by the grouping keys (called partition keys in the SQL standard,
> for what it's worth) but also specifications of ordering (for rank etc.)
> and window length (10 rows, or 2 hours).
>

You raise an interesting point here.  The current argument structure is
deficient.  We currently have before and after.  I think that should be
restated to start and end indexes with negative indexes to indicate
preceding records.  Your point here implies that we should also have a
starting expression and an ending expression.

> Given this, I would separate the aggregate operator into a GroupBy
> operator and a MovingAggregate operator. (The MovingAggregate operator
> might have sub-types such as sliding and paged, as I mentioned above.)

Should I take a stab at a revised specification for aggregation?  I dislike
the groupBy name for an aggregation, but could be convinced by a show of
hands.