You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Prasanth J <bu...@gmail.com> on 2012/06/21 22:28:58 UTC

Re: CUBE/ROLLUP/GROUPING SETS syntax

Hello all 

I initially implemented ROLLUP as a separate operation with the following syntax

a = ROLLUP inp BY (x,y);

which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like

GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL

out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. 

Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? 

Thanks
-- Prasanth

On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:

> As far as the underlying implementation, if they all use the same
> optimizations that you use in cube, then it can be LOCube. If they have
> their own optimizations etc (or could), it may be worth them having their
> own Logical operators (which might just be LOCube with flags for the time
> being) that allows us more flexibilty. But I suppose that's between you,
> eclipse, and your GSOC mentor.
> 
> 2012/5/30 Prasanth J <bu...@gmail.com>
> 
>> Thanks Alan and Jon for expressing your views.
>> 
>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>> 
>> rel = CUBE rel BY (dims);
>> rel = ROLLUP rel BY (dims);
>> rel = GROUPING_SET rel BY (dims);
>> 
>> 2 reasons why I do not prefer using SQL syntax is
>> 1) I do not want to break into existing Group operator implementation :)
>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>> For ex:
>> 
>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>> ROLLUP(dim7,dim8,dim9);
>> 
>> whereas same thing can be expressed like
>> 
>> rel = ROLLUP rel BY dim0,
>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>> 
>> Thanks Alan for pointing out the way for independently managing the
>> operators in parser and logical/physical plan. So for all these operators
>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>> differentiate between these three operations.
>> 
>> But, yes we are proliferating operators in this case.
>> 
>> Thanks
>> -- Prasanth
>> 
>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>> 
>>> 
>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>> 
>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>> in
>>>> the grammar can correspond to whatever logical and physical operators
>> you
>>>> want.
>>>> 
>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>> already pretty astonishing. I don't know why we would bend over
>> backwards
>>>> to make the syntax so similar in this case when even getting to the
>> point
>>>> of doing a CUBE means understanding an object model that is pretty
>>>> different from SQL.
>>>> 
>>>> On that note,
>>>> 
>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>> 
>>>> seems really confusing. I'd much rather overload the group operating
>> than
>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>> latin is simple enough that I don't think having a rollup, group_set,
>> etc
>>>> operator will be so confusing, because they're already going to be
>> typing
>>>> that stuff in the conext of
>>>> 
>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>> more, confusing syntax for the sake of creating parallels with a
>> language
>>>> we now share very little with.
>>> 
>>> Fair points.
>>> 
>>>> 
>>>> But I won't beat it any further... if people prefer a different syntax,
>>>> that's fine. Just excited to have the features in Pig!
>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>> Jon's).
>>> 
>>> Alan.
>>> 
>>>> Jon
>>>> 
>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
>>>> 
>>>>> Some thoughts on this:
>>>>> 
>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>> 
>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>> logical plan versus physical plan.  The choices you make in the
>> grammar are
>>>>> totally independent of the other two.  That is, you could choose the
>> syntax:
>>>>> 
>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>> 
>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>> BY
>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>> than
>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>> and
>>>>> POGroup vs. POCube.
>>>>> 
>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>> possible
>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>> close
>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>> exactly SQL, as it would end up looking like:
>>>>> 
>>>>> rel = GROUP rel BY CUBE (cols)
>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>> 
>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>> is
>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>> proposing
>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>> such
>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>> this
>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>> rel BY
>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>> 
>>>>> Alan.
>>>>> 
>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>> 
>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>> 
>>>>>> The reason why I came with a clause rather than a separate operator
>> was
>>>>> to avoid adding additional operators to the grammar.
>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>> operator, so only LOCube and POCube operators will be added
>> additionally.
>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>> the
>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>> 
>>>>>> I will give a quick recap of cube related operations and multiple
>> syntax
>>>>> options for achieving the same. I am also adding partial cubing and
>> rollup
>>>>> in this discussion.
>>>>>> 
>>>>>> 1) CUBE
>>>>>> 
>>>>>> Current syntax:
>>>>>> alias = CUBE rel BY (a, b);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b)
>>>>>> (a)
>>>>>> (b)
>>>>>> ()
>>>>>> 
>>>>>> 2) Partial CUBE
>>>>>> 
>>>>>> Proposed syntax:
>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b, c)
>>>>>> (a, b)
>>>>>> (a, c)
>>>>>> (a)
>>>>>> 
>>>>>> 3) ROLLUP
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY (a::b);
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b)
>>>>>> (a)
>>>>>> ()
>>>>>> 
>>>>>> 4) Partial ROLLUP
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b, c)
>>>>>> (a, b)
>>>>>> (a)
>>>>>> 
>>>>>> 5) GROUPING SETS
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a)
>>>>>> (b, c)
>>>>>> (c)
>>>>>> 
>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>> before I start hacking the grammar file.
>>>>>> 
>>>>>> Thanks
>>>>>> -- Prasanth
>>>>>> 
>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>> 
>>>>>>> Hey Prashanth, happy hacking.
>>>>>>> 
>>>>>>> My opinion:
>>>>>>> 
>>>>>>> CUBE:
>>>>>>> 
>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>> 
>>>>>>> 
>>>>>>> ROLLUP:
>>>>>>> 
>>>>>>> 
>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>> with
>>>>> CUBE. IE
>>>>>>> 
>>>>>>> 
>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> GROUPING SETS:
>>>>>>> 
>>>>>>> 
>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>> 
>>>>>>> 
>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>> maybe
>>>>>>> following the
>>>>>>> same pattern you could do something like:
>>>>>>> 
>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>> 
>>>>>>> As far as having, is there an optimization that can be done with a
>>>>> HAVING
>>>>>>> clause that can't be done based on the logical plan that comes
>>>>> afterwards?
>>>>>>> That seems odd to me. Since you have to materialize the result
>> anyway,
>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>> don't
>>>>>>> know why we need a special syntax.
>>>>>>> 
>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>> Jon
>>>>>>> 
>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
>>>>>>> 
>>>>>>>> Hello everyone
>>>>>>>> 
>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>> can
>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>> syntax
>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> -- Prasanth
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Jonathan Coveney <jc...@gmail.com>.

IMHO, I don't know that it's worth overloading the operator to support
that. If that's what it is, then they can just do what I said. I would
rather the focus be on getting ROLLUP, CUBE, etc implemented and efficient.

Curious what others think.

2012/6/21 Prasanth J <bu...@gmail.com>

> Yeah you are right.
>
> Thanks
> -- Prasanth
>
> On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote:
>
> > Just to make sure I understand this correctly, is
> >
> > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >
> > equivalent to:
> >
> > out1 = CUBE rel BY (a,b,c);
> > out2 = ROLLUP rel BY (c,d);
> > out3 = CUBY rel BY (e,f);
> >
> > out = CROSS out1, out2, out3;
> >
> > ?
> >
> > 2012/6/21 Prasanth J <bu...@gmail.com>
> >
> >> Hello all
> >>
> >> I initially implemented ROLLUP as a separate operation with the
> following
> >> syntax
> >>
> >> a = ROLLUP inp BY (x,y);
> >>
> >> which does the same thing as CUBE (inserting foreach + group-by in
> logical
> >> plan) except that it uses RollupDimensions UDF. But the issue with this
> >> approach is that we cannot mix CUBE and ROLLUP operations together in
> the
> >> same syntax which is a typical case. SQL/Oracle supports using CUBE and
> >> ROLLUP together like
> >>
> >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >>
> >> so I modified the pig grammar to support the similar usage. So now we
> can
> >> use a syntax similar to SQL
> >>
> >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >>
> >> In this approach, the logical plan should introduce cartesian product
> >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
> >> generating the final output. But I read from the documentation (
> >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS
> operator
> >> is an expensive operator and advices to use it sparingly.
> >>
> >> Is there any other way to achieve the cartesian product in a less
> >> expensive way? Also, does anyone have thoughts about this new syntax?
> >>
> >> Thanks
> >> -- Prasanth
> >>
> >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
> >>
> >>> As far as the underlying implementation, if they all use the same
> >>> optimizations that you use in cube, then it can be LOCube. If they have
> >>> their own optimizations etc (or could), it may be worth them having
> their
> >>> own Logical operators (which might just be LOCube with flags for the
> time
> >>> being) that allows us more flexibilty. But I suppose that's between
> you,
> >>> eclipse, and your GSOC mentor.
> >>>
> >>> 2012/5/30 Prasanth J <bu...@gmail.com>
> >>>
> >>>> Thanks Alan and Jon for expressing your views.
> >>>>
> >>>> I agree with Jon's point, if the syntax contains CUBE then user
> expects
> >> it
> >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and
> >> concise
> >>>>
> >>>> rel = CUBE rel BY (dims);
> >>>> rel = ROLLUP rel BY (dims);
> >>>> rel = GROUPING_SET rel BY (dims);
> >>>>
> >>>> 2 reasons why I do not prefer using SQL syntax is
> >>>> 1) I do not want to break into existing Group operator implementation
> :)
> >>>> 2) The syntax gets longer in case of partial hierarchical
> cubing/rollups
> >>>> For ex:
> >>>>
> >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
> >> ROLLUP(dim4,dim5,dim6),
> >>>> ROLLUP(dim7,dim8,dim9);
> >>>>
> >>>> whereas same thing can be expressed like
> >>>>
> >>>> rel = ROLLUP rel BY dim0,
> >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
> >>>>
> >>>> Thanks Alan for pointing out the way for independently managing the
> >>>> operators in parser and logical/physical plan. So for all these
> >> operators
> >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags
> to
> >>>> differentiate between these three operations.
> >>>>
> >>>> But, yes we are proliferating operators in this case.
> >>>>
> >>>> Thanks
> >>>> -- Prasanth
> >>>>
> >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
> >>>>
> >>>>>
> >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
> >>>>>
> >>>>>> I was going to say the same thing Alan said w.r.t. operators:
> >> operators
> >>>> in
> >>>>>> the grammar can correspond to whatever logical and physical
> operators
> >>>> you
> >>>>>> want.
> >>>>>>
> >>>>>> As far as the principle of least astonishment compared to SQL... Pig
> >> is
> >>>>>> already pretty astonishing. I don't know why we would bend over
> >>>> backwards
> >>>>>> to make the syntax so similar in this case when even getting to the
> >>>> point
> >>>>>> of doing a CUBE means understanding an object model that is pretty
> >>>>>> different from SQL.
> >>>>>>
> >>>>>> On that note,
> >>>>>>
> >>>>>> rel = CUBE rel BY GROUPING SETS(cols);
> >>>>>>
> >>>>>> seems really confusing. I'd much rather overload the group operating
> >>>> than
> >>>>>> the cube operator. If I see "cube," I expect a cube. If you start
> >> doing
> >>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
> >> Pig
> >>>>>> latin is simple enough that I don't think having a rollup,
> group_set,
> >>>> etc
> >>>>>> operator will be so confusing, because they're already going to be
> >>>> typing
> >>>>>> that stuff in the conext of
> >>>>>>
> >>>>>> group rel by rollup(cols); and so on. I don't see how it's worth
> >> adding
> >>>>>> more, confusing syntax for the sake of creating parallels with a
> >>>> language
> >>>>>> we now share very little with.
> >>>>>
> >>>>> Fair points.
> >>>>>
> >>>>>>
> >>>>>> But I won't beat it any further... if people prefer a different
> >> syntax,
> >>>>>> that's fine. Just excited to have the features in Pig!
> >>>>> +1, I can live with any of the 3 syntax choices (near SQL, original,
> >> and
> >>>> Jon's).
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>>> Jon
> >>>>>>
> >>>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
> >>>>>>
> >>>>>>> Some thoughts on this:
> >>>>>>>
> >>>>>>> 1) +1 to what Dmitriy said on HAVING
> >>>>>>>
> >>>>>>> 2) We need to be clear about separating operators in the grammar
> >> versus
> >>>>>>> logical plan versus physical plan.  The choices you make in the
> >>>> grammar are
> >>>>>>> totally independent of the other two.  That is, you could choose
> the
> >>>> syntax:
> >>>>>>>
> >>>>>>> rel = GROUP rel BY CUBE (a, b, c)
> >>>>>>>
> >>>>>>> and still have a separate POCube operator.  When the parser sees
> >> GROUP
> >>>> BY
> >>>>>>> CUBE it will generate an LOCube operator for the logical plan
> rather
> >>>> than
> >>>>>>> an LOGroup operator.  You can still have a separate POCube physical
> >>>>>>> operator.  Separate optimizations can be applied to LOGroup vs.
> >> LOCube
> >>>> and
> >>>>>>> POGroup vs. POCube.
> >>>>>>>
> >>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
> >>>> possible
> >>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
> >>>> close
> >>>>>>> to SQL is it conforms to the law of least astonishment.  It
> wouldn't
> >> be
> >>>>>>> exactly SQL, as it would end up looking like:
> >>>>>>>
> >>>>>>> rel = GROUP rel BY CUBE (cols)
> >>>>>>> rel = GROUP rel BY ROLLUP (cols)
> >>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
> >>>>>>>
> >>>>>>> The argument I see for sticking with Prasanth's approach is that
> >> GROUP
> >>>> is
> >>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
> >>>> proposing
> >>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to
> >> do
> >>>> such
> >>>>>>> a thing.  This makes CUBE really a separate operation.  But if we
> go
> >>>> this
> >>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and
> CUBE
> >>>> rel BY
> >>>>>>> GROUPING SETS.  Let's not proliferate operators.
> >>>>>>>
> >>>>>>> Alan.
> >>>>>>>
> >>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
> >>>>>>>
> >>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
> >>>>>>>>
> >>>>>>>> The reason why I came with a clause rather than a separate
> operator
> >>>> was
> >>>>>>> to avoid adding additional operators to the grammar.
> >>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical
> operators
> >>>>>>> adding to the complexity. I am planning to keep everything under
> cube
> >>>>>>> operator, so only LOCube and POCube operators will be added
> >>>> additionally.
> >>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause
> is
> >>>> the
> >>>>>>> same as FILTER so we do not need a separate HAVING clause.
> >>>>>>>>
> >>>>>>>> I will give a quick recap of cube related operations and multiple
> >>>> syntax
> >>>>>>> options for achieving the same. I am also adding partial cubing and
> >>>> rollup
> >>>>>>> in this discussion.
> >>>>>>>>
> >>>>>>>> 1) CUBE
> >>>>>>>>
> >>>>>>>> Current syntax:
> >>>>>>>> alias = CUBE rel BY (a, b);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>> (b)
> >>>>>>>> ()
> >>>>>>>>
> >>>>>>>> 2) Partial CUBE
> >>>>>>>>
> >>>>>>>> Proposed syntax:
> >>>>>>>> alias = CUBE rel BY a, (b, c);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b, c)
> >>>>>>>> (a, b)
> >>>>>>>> (a, c)
> >>>>>>>> (a)
> >>>>>>>>
> >>>>>>>> 3) ROLLUP
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY (a::b);
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = ROLLUP rel BY (a, b);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>> ()
> >>>>>>>>
> >>>>>>>> 4) Partial ROLLUP
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY (a, b::c);
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = ROLLUP rel BY a, (b, c);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b, c)
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>>
> >>>>>>>> 5) GROUPING SETS
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a)
> >>>>>>>> (b, c)
> >>>>>>>> (c)
> >>>>>>>>
> >>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a
> consensus
> >>>>>>> before I start hacking the grammar file.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> -- Prasanth
> >>>>>>>>
> >>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
> >>>>>>>>
> >>>>>>>>> Hey Prashanth, happy hacking.
> >>>>>>>>>
> >>>>>>>>> My opinion:
> >>>>>>>>>
> >>>>>>>>> CUBE:
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY (a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I like that syntax. It's unambiguous what is going on.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ROLLUP:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we
> did
> >>>> with
> >>>>>>> CUBE. IE
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = ROLLUP rel BY (a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> GROUPING SETS:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
> >>>> maybe
> >>>>>>>>> following the
> >>>>>>>>> same pattern you could do something like:
> >>>>>>>>>
> >>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
> >>>>>>>>>
> >>>>>>>>> As far as having, is there an optimization that can be done with
> a
> >>>>>>> HAVING
> >>>>>>>>> clause that can't be done based on the logical plan that comes
> >>>>>>> afterwards?
> >>>>>>>>> That seems odd to me. Since you have to materialize the result
> >>>> anyway,
> >>>>>>>>> can't the having clause just be a FILTER that comes after the
> >> cube? I
> >>>>>>> don't
> >>>>>>>>> know why we need a special syntax.
> >>>>>>>>>
> >>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
> >>>>>>>>> Jon
> >>>>>>>>>
> >>>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
> >>>>>>>>>
> >>>>>>>>>> Hello everyone
> >>>>>>>>>>
> >>>>>>>>>> I am looking for feedback from the community about the syntax
> for
> >>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
> >>>>>>>>>> I am moving the discussion from JIRA to dev-list so that
> everyone
> >>>> can
> >>>>>>>>>> share their opinion for operator syntax. Please have a look at
> the
> >>>>>>> syntax
> >>>>>>>>>> proposal at the link below and let me know your opinion
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> -- Prasanth
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Prasanth J <bu...@gmail.com>.

Yeah you are right.  

Thanks
-- Prasanth

On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote:

> Just to make sure I understand this correctly, is
> 
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> equivalent to:
> 
> out1 = CUBE rel BY (a,b,c);
> out2 = ROLLUP rel BY (c,d);
> out3 = CUBY rel BY (e,f);
> 
> out = CROSS out1, out2, out3;
> 
> ?
> 
> 2012/6/21 Prasanth J <bu...@gmail.com>
> 
>> Hello all
>> 
>> I initially implemented ROLLUP as a separate operation with the following
>> syntax
>> 
>> a = ROLLUP inp BY (x,y);
>> 
>> which does the same thing as CUBE (inserting foreach + group-by in logical
>> plan) except that it uses RollupDimensions UDF. But the issue with this
>> approach is that we cannot mix CUBE and ROLLUP operations together in the
>> same syntax which is a typical case. SQL/Oracle supports using CUBE and
>> ROLLUP together like
>> 
>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> so I modified the pig grammar to support the similar usage. So now we can
>> use a syntax similar to SQL
>> 
>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> In this approach, the logical plan should introduce cartesian product
>> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
>> generating the final output. But I read from the documentation (
>> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator
>> is an expensive operator and advices to use it sparingly.
>> 
>> Is there any other way to achieve the cartesian product in a less
>> expensive way? Also, does anyone have thoughts about this new syntax?
>> 
>> Thanks
>> -- Prasanth
>> 
>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>> 
>>> As far as the underlying implementation, if they all use the same
>>> optimizations that you use in cube, then it can be LOCube. If they have
>>> their own optimizations etc (or could), it may be worth them having their
>>> own Logical operators (which might just be LOCube with flags for the time
>>> being) that allows us more flexibilty. But I suppose that's between you,
>>> eclipse, and your GSOC mentor.
>>> 
>>> 2012/5/30 Prasanth J <bu...@gmail.com>
>>> 
>>>> Thanks Alan and Jon for expressing your views.
>>>> 
>>>> I agree with Jon's point, if the syntax contains CUBE then user expects
>> it
>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and
>> concise
>>>> 
>>>> rel = CUBE rel BY (dims);
>>>> rel = ROLLUP rel BY (dims);
>>>> rel = GROUPING_SET rel BY (dims);
>>>> 
>>>> 2 reasons why I do not prefer using SQL syntax is
>>>> 1) I do not want to break into existing Group operator implementation :)
>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>> For ex:
>>>> 
>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
>> ROLLUP(dim4,dim5,dim6),
>>>> ROLLUP(dim7,dim8,dim9);
>>>> 
>>>> whereas same thing can be expressed like
>>>> 
>>>> rel = ROLLUP rel BY dim0,
>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>> 
>>>> Thanks Alan for pointing out the way for independently managing the
>>>> operators in parser and logical/physical plan. So for all these
>> operators
>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>> differentiate between these three operations.
>>>> 
>>>> But, yes we are proliferating operators in this case.
>>>> 
>>>> Thanks
>>>> -- Prasanth
>>>> 
>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>> 
>>>>> 
>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>> 
>>>>>> I was going to say the same thing Alan said w.r.t. operators:
>> operators
>>>> in
>>>>>> the grammar can correspond to whatever logical and physical operators
>>>> you
>>>>>> want.
>>>>>> 
>>>>>> As far as the principle of least astonishment compared to SQL... Pig
>> is
>>>>>> already pretty astonishing. I don't know why we would bend over
>>>> backwards
>>>>>> to make the syntax so similar in this case when even getting to the
>>>> point
>>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>>> different from SQL.
>>>>>> 
>>>>>> On that note,
>>>>>> 
>>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>> 
>>>>>> seems really confusing. I'd much rather overload the group operating
>>>> than
>>>>>> the cube operator. If I see "cube," I expect a cube. If you start
>> doing
>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
>> Pig
>>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>>> etc
>>>>>> operator will be so confusing, because they're already going to be
>>>> typing
>>>>>> that stuff in the conext of
>>>>>> 
>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth
>> adding
>>>>>> more, confusing syntax for the sake of creating parallels with a
>>>> language
>>>>>> we now share very little with.
>>>>> 
>>>>> Fair points.
>>>>> 
>>>>>> 
>>>>>> But I won't beat it any further... if people prefer a different
>> syntax,
>>>>>> that's fine. Just excited to have the features in Pig!
>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original,
>> and
>>>> Jon's).
>>>>> 
>>>>> Alan.
>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
>>>>>> 
>>>>>>> Some thoughts on this:
>>>>>>> 
>>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>>> 
>>>>>>> 2) We need to be clear about separating operators in the grammar
>> versus
>>>>>>> logical plan versus physical plan.  The choices you make in the
>>>> grammar are
>>>>>>> totally independent of the other two.  That is, you could choose the
>>>> syntax:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>>> 
>>>>>>> and still have a separate POCube operator.  When the parser sees
>> GROUP
>>>> BY
>>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>>> than
>>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>>> operator.  Separate optimizations can be applied to LOGroup vs.
>> LOCube
>>>> and
>>>>>>> POGroup vs. POCube.
>>>>>>> 
>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>>> possible
>>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>>> close
>>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't
>> be
>>>>>>> exactly SQL, as it would end up looking like:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>>> 
>>>>>>> The argument I see for sticking with Prasanth's approach is that
>> GROUP
>>>> is
>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>>> proposing
>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to
>> do
>>>> such
>>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>>> this
>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>>> rel BY
>>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>>> 
>>>>>>> Alan.
>>>>>>> 
>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>>> 
>>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>>> 
>>>>>>>> The reason why I came with a clause rather than a separate operator
>>>> was
>>>>>>> to avoid adding additional operators to the grammar.
>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>>> operator, so only LOCube and POCube operators will be added
>>>> additionally.
>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>>> the
>>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>>> 
>>>>>>>> I will give a quick recap of cube related operations and multiple
>>>> syntax
>>>>>>> options for achieving the same. I am also adding partial cubing and
>>>> rollup
>>>>>>> in this discussion.
>>>>>>>> 
>>>>>>>> 1) CUBE
>>>>>>>> 
>>>>>>>> Current syntax:
>>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> (b)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 2) Partial CUBE
>>>>>>>> 
>>>>>>>> Proposed syntax:
>>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a, c)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 3) ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 4) Partial ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 5) GROUPING SETS
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a)
>>>>>>>> (b, c)
>>>>>>>> (c)
>>>>>>>> 
>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>>> before I start hacking the grammar file.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> -- Prasanth
>>>>>>>> 
>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>>> 
>>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>>> 
>>>>>>>>> My opinion:
>>>>>>>>> 
>>>>>>>>> CUBE:
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ROLLUP:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>>> with
>>>>>>> CUBE. IE
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> GROUPING SETS:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>>> maybe
>>>>>>>>> following the
>>>>>>>>> same pattern you could do something like:
>>>>>>>>> 
>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>>> HAVING
>>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>>> afterwards?
>>>>>>>>> That seems odd to me. Since you have to materialize the result
>>>> anyway,
>>>>>>>>> can't the having clause just be a FILTER that comes after the
>> cube? I
>>>>>>> don't
>>>>>>>>> know why we need a special syntax.
>>>>>>>>> 
>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>>> Jon
>>>>>>>>> 
>>>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
>>>>>>>>> 
>>>>>>>>>> Hello everyone
>>>>>>>>>> 
>>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>>> can
>>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>>> syntax
>>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> -- Prasanth
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Jonathan Coveney <jc...@gmail.com>.

Just to make sure I understand this correctly, is

out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

equivalent to:

out1 = CUBE rel BY (a,b,c);
out2 = ROLLUP rel BY (c,d);
out3 = CUBY rel BY (e,f);

out = CROSS out1, out2, out3;

?

2012/6/21 Prasanth J <bu...@gmail.com>

> Hello all
>
> I initially implemented ROLLUP as a separate operation with the following
> syntax
>
> a = ROLLUP inp BY (x,y);
>
> which does the same thing as CUBE (inserting foreach + group-by in logical
> plan) except that it uses RollupDimensions UDF. But the issue with this
> approach is that we cannot mix CUBE and ROLLUP operations together in the
> same syntax which is a typical case. SQL/Oracle supports using CUBE and
> ROLLUP together like
>
> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> so I modified the pig grammar to support the similar usage. So now we can
> use a syntax similar to SQL
>
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> In this approach, the logical plan should introduce cartesian product
> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
> generating the final output. But I read from the documentation (
> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator
> is an expensive operator and advices to use it sparingly.
>
> Is there any other way to achieve the cartesian product in a less
> expensive way? Also, does anyone have thoughts about this new syntax?
>
> Thanks
> -- Prasanth
>
> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>
> > As far as the underlying implementation, if they all use the same
> > optimizations that you use in cube, then it can be LOCube. If they have
> > their own optimizations etc (or could), it may be worth them having their
> > own Logical operators (which might just be LOCube with flags for the time
> > being) that allows us more flexibilty. But I suppose that's between you,
> > eclipse, and your GSOC mentor.
> >
> > 2012/5/30 Prasanth J <bu...@gmail.com>
> >
> >> Thanks Alan and Jon for expressing your views.
> >>
> >> I agree with Jon's point, if the syntax contains CUBE then user expects
> it
> >> to perform CUBE operation. So Jon's syntax seems more meaningful and
> concise
> >>
> >> rel = CUBE rel BY (dims);
> >> rel = ROLLUP rel BY (dims);
> >> rel = GROUPING_SET rel BY (dims);
> >>
> >> 2 reasons why I do not prefer using SQL syntax is
> >> 1) I do not want to break into existing Group operator implementation :)
> >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
> >> For ex:
> >>
> >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
> ROLLUP(dim4,dim5,dim6),
> >> ROLLUP(dim7,dim8,dim9);
> >>
> >> whereas same thing can be expressed like
> >>
> >> rel = ROLLUP rel BY dim0,
> >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
> >>
> >> Thanks Alan for pointing out the way for independently managing the
> >> operators in parser and logical/physical plan. So for all these
> operators
> >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
> >> differentiate between these three operations.
> >>
> >> But, yes we are proliferating operators in this case.
> >>
> >> Thanks
> >> -- Prasanth
> >>
> >> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
> >>
> >>>
> >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
> >>>
> >>>> I was going to say the same thing Alan said w.r.t. operators:
> operators
> >> in
> >>>> the grammar can correspond to whatever logical and physical operators
> >> you
> >>>> want.
> >>>>
> >>>> As far as the principle of least astonishment compared to SQL... Pig
> is
> >>>> already pretty astonishing. I don't know why we would bend over
> >> backwards
> >>>> to make the syntax so similar in this case when even getting to the
> >> point
> >>>> of doing a CUBE means understanding an object model that is pretty
> >>>> different from SQL.
> >>>>
> >>>> On that note,
> >>>>
> >>>> rel = CUBE rel BY GROUPING SETS(cols);
> >>>>
> >>>> seems really confusing. I'd much rather overload the group operating
> >> than
> >>>> the cube operator. If I see "cube," I expect a cube. If you start
> doing
> >>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
> Pig
> >>>> latin is simple enough that I don't think having a rollup, group_set,
> >> etc
> >>>> operator will be so confusing, because they're already going to be
> >> typing
> >>>> that stuff in the conext of
> >>>>
> >>>> group rel by rollup(cols); and so on. I don't see how it's worth
> adding
> >>>> more, confusing syntax for the sake of creating parallels with a
> >> language
> >>>> we now share very little with.
> >>>
> >>> Fair points.
> >>>
> >>>>
> >>>> But I won't beat it any further... if people prefer a different
> syntax,
> >>>> that's fine. Just excited to have the features in Pig!
> >>> +1, I can live with any of the 3 syntax choices (near SQL, original,
> and
> >> Jon's).
> >>>
> >>> Alan.
> >>>
> >>>> Jon
> >>>>
> >>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
> >>>>
> >>>>> Some thoughts on this:
> >>>>>
> >>>>> 1) +1 to what Dmitriy said on HAVING
> >>>>>
> >>>>> 2) We need to be clear about separating operators in the grammar
> versus
> >>>>> logical plan versus physical plan.  The choices you make in the
> >> grammar are
> >>>>> totally independent of the other two.  That is, you could choose the
> >> syntax:
> >>>>>
> >>>>> rel = GROUP rel BY CUBE (a, b, c)
> >>>>>
> >>>>> and still have a separate POCube operator.  When the parser sees
> GROUP
> >> BY
> >>>>> CUBE it will generate an LOCube operator for the logical plan rather
> >> than
> >>>>> an LOGroup operator.  You can still have a separate POCube physical
> >>>>> operator.  Separate optimizations can be applied to LOGroup vs.
> LOCube
> >> and
> >>>>> POGroup vs. POCube.
> >>>>>
> >>>>> 3) On syntax I can see arguments for keeping as close to SQL as
> >> possible
> >>>>> and for the syntax proposed by Prasanth.  The argument for sticking
> >> close
> >>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't
> be
> >>>>> exactly SQL, as it would end up looking like:
> >>>>>
> >>>>> rel = GROUP rel BY CUBE (cols)
> >>>>> rel = GROUP rel BY ROLLUP (cols)
> >>>>> rel = GROUP rel BY GROUPING SETS(cols);
> >>>>>
> >>>>> The argument I see for sticking with Prasanth's approach is that
> GROUP
> >> is
> >>>>> really short for COGROUP in Pig Latin, and I don't think we're
> >> proposing
> >>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to
> do
> >> such
> >>>>> a thing.  This makes CUBE really a separate operation.  But if we go
> >> this
> >>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
> >> rel BY
> >>>>> GROUPING SETS.  Let's not proliferate operators.
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
> >>>>>
> >>>>>> Thanks Jonathan for looking into it and for your suggestions.
> >>>>>>
> >>>>>> The reason why I came with a clause rather than a separate operator
> >> was
> >>>>> to avoid adding additional operators to the grammar.
> >>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
> >>>>> adding to the complexity. I am planning to keep everything under cube
> >>>>> operator, so only LOCube and POCube operators will be added
> >> additionally.
> >>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
> >> the
> >>>>> same as FILTER so we do not need a separate HAVING clause.
> >>>>>>
> >>>>>> I will give a quick recap of cube related operations and multiple
> >> syntax
> >>>>> options for achieving the same. I am also adding partial cubing and
> >> rollup
> >>>>> in this discussion.
> >>>>>>
> >>>>>> 1) CUBE
> >>>>>>
> >>>>>> Current syntax:
> >>>>>> alias = CUBE rel BY (a, b);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>> (b)
> >>>>>> ()
> >>>>>>
> >>>>>> 2) Partial CUBE
> >>>>>>
> >>>>>> Proposed syntax:
> >>>>>> alias = CUBE rel BY a, (b, c);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b, c)
> >>>>>> (a, b)
> >>>>>> (a, c)
> >>>>>> (a)
> >>>>>>
> >>>>>> 3) ROLLUP
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY ROLLUP(a, b);
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY (a::b);
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = ROLLUP rel BY (a, b);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>> ()
> >>>>>>
> >>>>>> 4) Partial ROLLUP
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY (a, b::c);
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = ROLLUP rel BY a, (b, c);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b, c)
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>>
> >>>>>> 5) GROUPING SETS
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a)
> >>>>>> (b, c)
> >>>>>> (c)
> >>>>>>
> >>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
> >>>>> before I start hacking the grammar file.
> >>>>>>
> >>>>>> Thanks
> >>>>>> -- Prasanth
> >>>>>>
> >>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
> >>>>>>
> >>>>>>> Hey Prashanth, happy hacking.
> >>>>>>>
> >>>>>>> My opinion:
> >>>>>>>
> >>>>>>> CUBE:
> >>>>>>>
> >>>>>>> alias = CUBE rel BY (a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> I like that syntax. It's unambiguous what is going on.
> >>>>>>>
> >>>>>>>
> >>>>>>> ROLLUP:
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
> >> with
> >>>>> CUBE. IE
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = ROLLUP rel BY (a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> GROUPING SETS:
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
> >>>>>>>
> >>>>>>>
> >>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
> >> maybe
> >>>>>>> following the
> >>>>>>> same pattern you could do something like:
> >>>>>>>
> >>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
> >>>>>>>
> >>>>>>> As far as having, is there an optimization that can be done with a
> >>>>> HAVING
> >>>>>>> clause that can't be done based on the logical plan that comes
> >>>>> afterwards?
> >>>>>>> That seems odd to me. Since you have to materialize the result
> >> anyway,
> >>>>>>> can't the having clause just be a FILTER that comes after the
> cube? I
> >>>>> don't
> >>>>>>> know why we need a special syntax.
> >>>>>>>
> >>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
> >>>>>>> Jon
> >>>>>>>
> >>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
> >>>>>>>
> >>>>>>>> Hello everyone
> >>>>>>>>
> >>>>>>>> I am looking for feedback from the community about the syntax for
> >>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
> >>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
> >> can
> >>>>>>>> share their opinion for operator syntax. Please have a look at the
> >>>>> syntax
> >>>>>>>> proposal at the link below and let me know your opinion
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> -- Prasanth
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

One happens on the mapper.

On Thu, Jun 21, 2012 at 2:52 PM, Prasanth J <bu...@gmail.com> wrote:
> Thanks Alan.
> Your suggestion looks correct.
>
> I think with this I can achieve what I wanted in the same syntax
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> Just curious to know.
> How is this different from CROSS? and why is CROSS expensive when compared to flatten?
>
> Thanks
> -- Prasanth
>
> On Jun 21, 2012, at 5:11 PM, Alan Gates wrote:
>
>> I think I'm missing something here.  The result of the "out =" line is three bags, correct?  If that's the case, the cross product you want is achieved by doing:
>>
>> result = foreach out generate flatten($0), flatten($1), flatten($2)
>>
>> This is not the same as CROSS, which would be expensive.
>>
>> Alan.
>>
>> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:
>>
>>> Hello all
>>>
>>> I initially implemented ROLLUP as a separate operation with the following syntax
>>>
>>> a = ROLLUP inp BY (x,y);
>>>
>>> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like
>>>
>>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL
>>>
>>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly.
>>>
>>> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax?
>>>
>>> Thanks
>>> -- Prasanth
>>>
>>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>>>
>>>> As far as the underlying implementation, if they all use the same
>>>> optimizations that you use in cube, then it can be LOCube. If they have
>>>> their own optimizations etc (or could), it may be worth them having their
>>>> own Logical operators (which might just be LOCube with flags for the time
>>>> being) that allows us more flexibilty. But I suppose that's between you,
>>>> eclipse, and your GSOC mentor.
>>>>
>>>> 2012/5/30 Prasanth J <bu...@gmail.com>
>>>>
>>>>> Thanks Alan and Jon for expressing your views.
>>>>>
>>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>>>>
>>>>> rel = CUBE rel BY (dims);
>>>>> rel = ROLLUP rel BY (dims);
>>>>> rel = GROUPING_SET rel BY (dims);
>>>>>
>>>>> 2 reasons why I do not prefer using SQL syntax is
>>>>> 1) I do not want to break into existing Group operator implementation :)
>>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>>> For ex:
>>>>>
>>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>>>> ROLLUP(dim7,dim8,dim9);
>>>>>
>>>>> whereas same thing can be expressed like
>>>>>
>>>>> rel = ROLLUP rel BY dim0,
>>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>>>
>>>>> Thanks Alan for pointing out the way for independently managing the
>>>>> operators in parser and logical/physical plan. So for all these operators
>>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>>> differentiate between these three operations.
>>>>>
>>>>> But, yes we are proliferating operators in this case.
>>>>>
>>>>> Thanks
>>>>> -- Prasanth
>>>>>
>>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>>>
>>>>>>
>>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>>>
>>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>>>> in
>>>>>>> the grammar can correspond to whatever logical and physical operators
>>>>> you
>>>>>>> want.
>>>>>>>
>>>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>>>> already pretty astonishing. I don't know why we would bend over
>>>>> backwards
>>>>>>> to make the syntax so similar in this case when even getting to the
>>>>> point
>>>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>>>> different from SQL.
>>>>>>>
>>>>>>> On that note,
>>>>>>>
>>>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>>>
>>>>>>> seems really confusing. I'd much rather overload the group operating
>>>>> than
>>>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>>>> etc
>>>>>>> operator will be so confusing, because they're already going to be
>>>>> typing
>>>>>>> that stuff in the conext of
>>>>>>>
>>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>>>>> more, confusing syntax for the sake of creating parallels with a
>>>>> language
>>>>>>> we now share very little with.
>>>>>>
>>>>>> Fair points.
>>>>>>
>>>>>>>
>>>>>>> But I won't beat it any further... if people prefer a different syntax,
>>>>>>> that's fine. Just excited to have the features in Pig!
>>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>>>>> Jon's).
>>>>>>
>>>>>> Alan.
>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
>>>>>>>
>>>>>>>> Some thoughts on this:
>>>>>>>>
>>>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>>>>
>>>>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>>>>> logical plan versus physical plan.  The choices you make in the
>>>>> grammar are
>>>>>>>> totally independent of the other two.  That is, you could choose the
>>>>> syntax:
>>>>>>>>
>>>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>>>>
>>>>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>>>>> BY
>>>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>>>> than
>>>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>>>>> and
>>>>>>>> POGroup vs. POCube.
>>>>>>>>
>>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>>>> possible
>>>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>>>> close
>>>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>>>>> exactly SQL, as it would end up looking like:
>>>>>>>>
>>>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>>>>
>>>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>>>>> is
>>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>>>> proposing
>>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>>>>> such
>>>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>>>> this
>>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>>>> rel BY
>>>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>>>>
>>>>>>>> Alan.
>>>>>>>>
>>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>>>>
>>>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>>>>
>>>>>>>>> The reason why I came with a clause rather than a separate operator
>>>>> was
>>>>>>>> to avoid adding additional operators to the grammar.
>>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>>>> operator, so only LOCube and POCube operators will be added
>>>>> additionally.
>>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>>>> the
>>>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>>>>
>>>>>>>>> I will give a quick recap of cube related operations and multiple
>>>>> syntax
>>>>>>>> options for achieving the same. I am also adding partial cubing and
>>>>> rollup
>>>>>>>> in this discussion.
>>>>>>>>>
>>>>>>>>> 1) CUBE
>>>>>>>>>
>>>>>>>>> Current syntax:
>>>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>> (b)
>>>>>>>>> ()
>>>>>>>>>
>>>>>>>>> 2) Partial CUBE
>>>>>>>>>
>>>>>>>>> Proposed syntax:
>>>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b, c)
>>>>>>>>> (a, b)
>>>>>>>>> (a, c)
>>>>>>>>> (a)
>>>>>>>>>
>>>>>>>>> 3) ROLLUP
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>> ()
>>>>>>>>>
>>>>>>>>> 4) Partial ROLLUP
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b, c)
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>>
>>>>>>>>> 5) GROUPING SETS
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a)
>>>>>>>>> (b, c)
>>>>>>>>> (c)
>>>>>>>>>
>>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>>>> before I start hacking the grammar file.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -- Prasanth
>>>>>>>>>
>>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>>>>
>>>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>>>>
>>>>>>>>>> My opinion:
>>>>>>>>>>
>>>>>>>>>> CUBE:
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ROLLUP:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>>>> with
>>>>>>>> CUBE. IE
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> GROUPING SETS:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>>>> maybe
>>>>>>>>>> following the
>>>>>>>>>> same pattern you could do something like:
>>>>>>>>>>
>>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>>>>
>>>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>>>> HAVING
>>>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>>>> afterwards?
>>>>>>>>>> That seems odd to me. Since you have to materialize the result
>>>>> anyway,
>>>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>>>>> don't
>>>>>>>>>> know why we need a special syntax.
>>>>>>>>>>
>>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hello everyone
>>>>>>>>>>>
>>>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>>>> can
>>>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>>>> syntax
>>>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> -- Prasanth
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Prasanth J <bu...@gmail.com>.

Thanks Alan.
Your suggestion looks correct.

I think with this I can achieve what I wanted in the same syntax 
out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); 

Just curious to know. 
How is this different from CROSS? and why is CROSS expensive when compared to flatten?

Thanks
-- Prasanth

On Jun 21, 2012, at 5:11 PM, Alan Gates wrote:

> I think I'm missing something here.  The result of the "out =" line is three bags, correct?  If that's the case, the cross product you want is achieved by doing:
> 
> result = foreach out generate flatten($0), flatten($1), flatten($2)
> 
> This is not the same as CROSS, which would be expensive.
> 
> Alan.
> 
> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:
> 
>> Hello all 
>> 
>> I initially implemented ROLLUP as a separate operation with the following syntax
>> 
>> a = ROLLUP inp BY (x,y);
>> 
>> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like
>> 
>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL
>> 
>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. 
>> 
>> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? 
>> 
>> Thanks
>> -- Prasanth
>> 
>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>> 
>>> As far as the underlying implementation, if they all use the same
>>> optimizations that you use in cube, then it can be LOCube. If they have
>>> their own optimizations etc (or could), it may be worth them having their
>>> own Logical operators (which might just be LOCube with flags for the time
>>> being) that allows us more flexibilty. But I suppose that's between you,
>>> eclipse, and your GSOC mentor.
>>> 
>>> 2012/5/30 Prasanth J <bu...@gmail.com>
>>> 
>>>> Thanks Alan and Jon for expressing your views.
>>>> 
>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>>> 
>>>> rel = CUBE rel BY (dims);
>>>> rel = ROLLUP rel BY (dims);
>>>> rel = GROUPING_SET rel BY (dims);
>>>> 
>>>> 2 reasons why I do not prefer using SQL syntax is
>>>> 1) I do not want to break into existing Group operator implementation :)
>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>> For ex:
>>>> 
>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>>> ROLLUP(dim7,dim8,dim9);
>>>> 
>>>> whereas same thing can be expressed like
>>>> 
>>>> rel = ROLLUP rel BY dim0,
>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>> 
>>>> Thanks Alan for pointing out the way for independently managing the
>>>> operators in parser and logical/physical plan. So for all these operators
>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>> differentiate between these three operations.
>>>> 
>>>> But, yes we are proliferating operators in this case.
>>>> 
>>>> Thanks
>>>> -- Prasanth
>>>> 
>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>> 
>>>>> 
>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>> 
>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>>> in
>>>>>> the grammar can correspond to whatever logical and physical operators
>>>> you
>>>>>> want.
>>>>>> 
>>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>>> already pretty astonishing. I don't know why we would bend over
>>>> backwards
>>>>>> to make the syntax so similar in this case when even getting to the
>>>> point
>>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>>> different from SQL.
>>>>>> 
>>>>>> On that note,
>>>>>> 
>>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>> 
>>>>>> seems really confusing. I'd much rather overload the group operating
>>>> than
>>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>>> etc
>>>>>> operator will be so confusing, because they're already going to be
>>>> typing
>>>>>> that stuff in the conext of
>>>>>> 
>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>>>> more, confusing syntax for the sake of creating parallels with a
>>>> language
>>>>>> we now share very little with.
>>>>> 
>>>>> Fair points.
>>>>> 
>>>>>> 
>>>>>> But I won't beat it any further... if people prefer a different syntax,
>>>>>> that's fine. Just excited to have the features in Pig!
>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>>>> Jon's).
>>>>> 
>>>>> Alan.
>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
>>>>>> 
>>>>>>> Some thoughts on this:
>>>>>>> 
>>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>>> 
>>>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>>>> logical plan versus physical plan.  The choices you make in the
>>>> grammar are
>>>>>>> totally independent of the other two.  That is, you could choose the
>>>> syntax:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>>> 
>>>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>>>> BY
>>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>>> than
>>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>>>> and
>>>>>>> POGroup vs. POCube.
>>>>>>> 
>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>>> possible
>>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>>> close
>>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>>>> exactly SQL, as it would end up looking like:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>>> 
>>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>>>> is
>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>>> proposing
>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>>>> such
>>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>>> this
>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>>> rel BY
>>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>>> 
>>>>>>> Alan.
>>>>>>> 
>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>>> 
>>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>>> 
>>>>>>>> The reason why I came with a clause rather than a separate operator
>>>> was
>>>>>>> to avoid adding additional operators to the grammar.
>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>>> operator, so only LOCube and POCube operators will be added
>>>> additionally.
>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>>> the
>>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>>> 
>>>>>>>> I will give a quick recap of cube related operations and multiple
>>>> syntax
>>>>>>> options for achieving the same. I am also adding partial cubing and
>>>> rollup
>>>>>>> in this discussion.
>>>>>>>> 
>>>>>>>> 1) CUBE
>>>>>>>> 
>>>>>>>> Current syntax:
>>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> (b)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 2) Partial CUBE
>>>>>>>> 
>>>>>>>> Proposed syntax:
>>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a, c)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 3) ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 4) Partial ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 5) GROUPING SETS
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a)
>>>>>>>> (b, c)
>>>>>>>> (c)
>>>>>>>> 
>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>>> before I start hacking the grammar file.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> -- Prasanth
>>>>>>>> 
>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>>> 
>>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>>> 
>>>>>>>>> My opinion:
>>>>>>>>> 
>>>>>>>>> CUBE:
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ROLLUP:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>>> with
>>>>>>> CUBE. IE
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> GROUPING SETS:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>>> maybe
>>>>>>>>> following the
>>>>>>>>> same pattern you could do something like:
>>>>>>>>> 
>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>>> HAVING
>>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>>> afterwards?
>>>>>>>>> That seems odd to me. Since you have to materialize the result
>>>> anyway,
>>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>>>> don't
>>>>>>>>> know why we need a special syntax.
>>>>>>>>> 
>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>>> Jon
>>>>>>>>> 
>>>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
>>>>>>>>> 
>>>>>>>>>> Hello everyone
>>>>>>>>>> 
>>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>>> can
>>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>>> syntax
>>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> -- Prasanth
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Posted by Alan Gates <ga...@hortonworks.com>.

I think I'm missing something here.  The result of the "out =" line is three bags, correct?  If that's the case, the cross product you want is achieved by doing:

result = foreach out generate flatten($0), flatten($1), flatten($2)

This is not the same as CROSS, which would be expensive.

Alan.

On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:

> Hello all 
> 
> I initially implemented ROLLUP as a separate operation with the following syntax
> 
> a = ROLLUP inp BY (x,y);
> 
> which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like
> 
> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL
> 
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. 
> 
> Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? 
> 
> Thanks
> -- Prasanth
> 
> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
> 
>> As far as the underlying implementation, if they all use the same
>> optimizations that you use in cube, then it can be LOCube. If they have
>> their own optimizations etc (or could), it may be worth them having their
>> own Logical operators (which might just be LOCube with flags for the time
>> being) that allows us more flexibilty. But I suppose that's between you,
>> eclipse, and your GSOC mentor.
>> 
>> 2012/5/30 Prasanth J <bu...@gmail.com>
>> 
>>> Thanks Alan and Jon for expressing your views.
>>> 
>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>> 
>>> rel = CUBE rel BY (dims);
>>> rel = ROLLUP rel BY (dims);
>>> rel = GROUPING_SET rel BY (dims);
>>> 
>>> 2 reasons why I do not prefer using SQL syntax is
>>> 1) I do not want to break into existing Group operator implementation :)
>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>> For ex:
>>> 
>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>> ROLLUP(dim7,dim8,dim9);
>>> 
>>> whereas same thing can be expressed like
>>> 
>>> rel = ROLLUP rel BY dim0,
>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>> 
>>> Thanks Alan for pointing out the way for independently managing the
>>> operators in parser and logical/physical plan. So for all these operators
>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>> differentiate between these three operations.
>>> 
>>> But, yes we are proliferating operators in this case.
>>> 
>>> Thanks
>>> -- Prasanth
>>> 
>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>> 
>>>> 
>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>> 
>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>> in
>>>>> the grammar can correspond to whatever logical and physical operators
>>> you
>>>>> want.
>>>>> 
>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>> already pretty astonishing. I don't know why we would bend over
>>> backwards
>>>>> to make the syntax so similar in this case when even getting to the
>>> point
>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>> different from SQL.
>>>>> 
>>>>> On that note,
>>>>> 
>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>> 
>>>>> seems really confusing. I'd much rather overload the group operating
>>> than
>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>> etc
>>>>> operator will be so confusing, because they're already going to be
>>> typing
>>>>> that stuff in the conext of
>>>>> 
>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>>> more, confusing syntax for the sake of creating parallels with a
>>> language
>>>>> we now share very little with.
>>>> 
>>>> Fair points.
>>>> 
>>>>> 
>>>>> But I won't beat it any further... if people prefer a different syntax,
>>>>> that's fine. Just excited to have the features in Pig!
>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>>> Jon's).
>>>> 
>>>> Alan.
>>>> 
>>>>> Jon
>>>>> 
>>>>> 2012/5/30 Alan Gates <ga...@hortonworks.com>
>>>>> 
>>>>>> Some thoughts on this:
>>>>>> 
>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>> 
>>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>>> logical plan versus physical plan.  The choices you make in the
>>> grammar are
>>>>>> totally independent of the other two.  That is, you could choose the
>>> syntax:
>>>>>> 
>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>> 
>>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>>> BY
>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>> than
>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>>> and
>>>>>> POGroup vs. POCube.
>>>>>> 
>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>> possible
>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>> close
>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>>> exactly SQL, as it would end up looking like:
>>>>>> 
>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>> 
>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>>> is
>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>> proposing
>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>>> such
>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>> this
>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>> rel BY
>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>> 
>>>>>> Alan.
>>>>>> 
>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>> 
>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>> 
>>>>>>> The reason why I came with a clause rather than a separate operator
>>> was
>>>>>> to avoid adding additional operators to the grammar.
>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>> operator, so only LOCube and POCube operators will be added
>>> additionally.
>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>> the
>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>> 
>>>>>>> I will give a quick recap of cube related operations and multiple
>>> syntax
>>>>>> options for achieving the same. I am also adding partial cubing and
>>> rollup
>>>>>> in this discussion.
>>>>>>> 
>>>>>>> 1) CUBE
>>>>>>> 
>>>>>>> Current syntax:
>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> (b)
>>>>>>> ()
>>>>>>> 
>>>>>>> 2) Partial CUBE
>>>>>>> 
>>>>>>> Proposed syntax:
>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b, c)
>>>>>>> (a, b)
>>>>>>> (a, c)
>>>>>>> (a)
>>>>>>> 
>>>>>>> 3) ROLLUP
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> ()
>>>>>>> 
>>>>>>> 4) Partial ROLLUP
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b, c)
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> 
>>>>>>> 5) GROUPING SETS
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a)
>>>>>>> (b, c)
>>>>>>> (c)
>>>>>>> 
>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>> before I start hacking the grammar file.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> -- Prasanth
>>>>>>> 
>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>> 
>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>> 
>>>>>>>> My opinion:
>>>>>>>> 
>>>>>>>> CUBE:
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ROLLUP:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>> with
>>>>>> CUBE. IE
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> GROUPING SETS:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>> maybe
>>>>>>>> following the
>>>>>>>> same pattern you could do something like:
>>>>>>>> 
>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>> 
>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>> HAVING
>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>> afterwards?
>>>>>>>> That seems odd to me. Since you have to materialize the result
>>> anyway,
>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>>> don't
>>>>>>>> know why we need a special syntax.
>>>>>>>> 
>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>> Jon
>>>>>>>> 
>>>>>>>> 2012/5/27 Prasanth J <bu...@gmail.com>
>>>>>>>> 
>>>>>>>>> Hello everyone
>>>>>>>>> 
>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>> can
>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>> syntax
>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> -- Prasanth
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>