You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/07/15 17:12:00 UTC

[jira] [Created] (PIG-2167) CUBE operation in Pig

CUBE operation in Pig
---------------------

                 Key: PIG-2167
                 URL: https://issues.apache.org/jira/browse/PIG-2167
             Project: Pig
          Issue Type: New Feature
            Reporter: Dmitriy V. Ryaboy
             Fix For: 0.10


Computing aggregates over a cube of several dimensions is a common operation in data warehousing.

The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".

A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk

We can start with the naive solution which only works for algebraic measures, and work up from there.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252251#comment-13252251 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Submitting patch(version 2) with updates based on code review comments by Dmitriy. 
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265665#comment-13265665 ] 

Prasanth J commented on PIG-2167:
---------------------------------

I am not able to attach the patch(created using git diff) to review board. So attaching it here. 
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282179#comment-13282179 ] 

Russell Jurney commented on PIG-2167:
-------------------------------------

Thanks prasanth, but which JIRA do you mean? PIG-2170 hasn't been updated since august?
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Prasanth J
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2167:
----------------------------

    Description: 
Computing aggregates over a cube of several dimensions is a common operation in data warehousing.

The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".

A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk

We can start with the naive solution which only works for algebraic measures, and work up from there.

This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

  was:
Computing aggregates over a cube of several dimensions is a common operation in data warehousing.

The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".

A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk

We can start with the naive solution which only works for algebraic measures, and work up from there.

    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265658#comment-13265658 ] 

jiraposter@reviews.apache.org commented on PIG-2167:
----------------------------------------------------



bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java, line 74
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100633#file100633line74>
bq.  >
bq.  >     I see, you cut this out in LogicalPlanGenerator so you assume this won't get called before the plan is generated. That'll probably work for now. We should definitely fix this in the next stage.

Yeah. Will fix that in the next stage (implementing physical operator).


bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java, line 561
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100639#file100639line561>
bq.  >
bq.  >     you can merge this try/catch with the one above.

Done.


bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 58
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line58>
bq.  >
bq.  >     try Util.createFile(String[] data)
bq.  >     
bq.  >     Even better, you can now use the new MockLoader from PIG-2650 (save a bit more time and complexity on the tests..)

Updated. Now all test cases use MockLoader.


bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 302
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line302>
bq.  >
bq.  >     return here

Updated.


bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 318
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line318>
bq.  >
bq.  >     Assert.fail("Expected to throw an exception when duplicate dimensions are detected!");
bq.  >     
bq.  >     (otherwise your test passes even if this doesn't work!)

Updated. 


bq.  On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 74
bq.  > <https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line74>
bq.  >
bq.  >     ExecMode.LOCAL iirc

PigServer constructor accepts string ('local') or ExecType (ExecType.LOCAL). If its a string it internally converts to corresponding ExecType. 


- Prasanth_J


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7373
-----------------------------------------------------------


On 2012-04-30 04:05:03, Prasanth_J wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4670/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-30 04:05:03)
bq.  
bq.  
bq.  Review request for pig and Dmitriy Ryaboy.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This is a review board for https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  This addresses bug PIG-2167.
bq.      https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/4670/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Unit tests: All passed
bq.  
bq.  Pre-commit tests: All passed
bq.  ant clean test-commit
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Prasanth_J
bq.  
bq.


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Olga Natkovich (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-2167:
--------------------------------

    Fix Version/s:     (was: 0.10)
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277927#comment-13277927 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

All looks good except for HAVING. The reason SQL has a HAVING clause is that it doesn't have a way to express post-group filtering without using subselects. In Pig Latin, we do -- it's just a FILTER clause that follows your grouping clause. So the same effect can be achieved:

{code}
alias = CUBE rel BY (a,b,c);
filtered_alias = FILTER alias BY SUM(rel.d) > 10;
{code}

Note that this assumes the schema of alias is (group, rel) where "rel" is a complex tuple representing a matching row from the original relation (the same way the GROUP operator currently works).

D
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264642#comment-13264642 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

Prasanth, seems like you didn't update the RB?

Makes it hard to see what changed between v.1 and v.2
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240011#comment-13240011 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

Assigning to self as per apache guidelines, as I'd like to mentor this. Putting my application in right now.
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth J updated PIG-2167:
----------------------------

    Attachment: PIG-2167.3.patch
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279941#comment-13279941 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

Naive Cube implementation looks good. +1, will commit.

Let's create subtickets for incremental improvements as you have outlined them in the GSoC proposal, and continue discussion in those. Might as well create a subticket for the naive one (otherwise I have nothing to close out :)).
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269090#comment-13269090 ] 

jiraposter@reviews.apache.org commented on PIG-2167:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/
-----------------------------------------------------------

(Updated 2012-05-06 00:13:16.285229)


Review request for pig and Dmitriy Ryaboy.


Changes
-------

Incorporated Dmitriy's and Julien's review comments. 

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit


Summary
-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167


This addresses bug PIG-2167.
    https://issues.apache.org/jira/browse/PIG-2167


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1334534 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1334534 

Diff: https://reviews.apache.org/r/4670/diff


Testing
-------

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit


Thanks,

Prasanth_J


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy reassigned PIG-2167:
--------------------------------------

    Assignee: Dmitriy V. Ryaboy
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277644#comment-13277644 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello everyone

I will be working with Dmitriy on this issue for GSoC 2012 which is about to start next week (May 21). Before getting started with the project I would like to hear from the community about the syntax for CUBE (and related) operators. As a part of this project I also have plans for implementing hierarchy based grouping, partial grouping and monotonicity based pruning using HAVING clause. 

Following are the list of tasks that require modifications to the grammar file and for each of them I have included SQL/Oracle syntax, the proposed syntax for pig and the corresponding output groups. I would like to know feedback for each of the suggested syntax and finalize the syntax for all of them.

1) *CUBE* operation (all combinations of grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY CUBE (a,b,c);{noformat}

_Current syntax in Pig_
{noformat}alias = CUBE rel BY (a,b,c);{noformat}

_Output groups:_
{(a,b,c), (a,b), (b,c), (a,c), (a), (b), (c), ()}

2) *ROLLUP* operation (hierarchy based grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY ROLLUP (a,b,c);{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY ROLLUP(a,b,c);{noformat}

_Output groups:_
{(a,b,c), (a,b), (a), ()}

3) *GROUPING SETS* operation (partial grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY GROUPING SETS ((a,b),(b),());{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY GROUPING SETS((a,b),(b),());{noformat}

_Output groups:_
{(a,b), (b), ()}

4) *HAVING* clause (for pruning the groups that does not satisfy the specified condition)
_SQL/Oracle syntax:_
{noformat}GROUP BY CUBE (x,y)  
HAVING SUM(z)>100{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY (a,b,c) HAVING SUM(d)>100;{noformat}

_Output groups:_
Only the groups that satisfy the specified condition appears in the output

The primary reasons for having a separate CUBE operator are
* Ease of implementation without modification to existing GROUP BY operator implementation
* Ability to have a separate Physical Operator (POCube)
* Cube specific optimizations can be applied to the physical operator
* Some operations applicable to GROUP operator are not applicable for CUBE
** Constant expression evaluation
** Duplicate column projection

Please let me know your feedback/suggestions for the operator syntax. 

*References:*
* Oracle documentation for CUBE http://docs.oracle.com/cd/B19306_01/server.102/b14223/aggreg.htm#i1012749
* SQL server documentation for CUBE http://msdn.microsoft.com/en-us/library/ms177673.aspx
* Simple CUBE/ROLLUP operations in Oracle and SQL 
http://sqlfiddle.com/#!3/3bba2/11
http://sqlfiddle.com/#!4/32c81/26

                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248147#comment-13248147 ] 

Prasanth J commented on PIG-2167:
---------------------------------

The URL didn't redirect correctly. Pasting it again 
[http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/pras_j/1]
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth J updated PIG-2167:
----------------------------

    Attachment: PIG-2167.4.patch
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth J updated PIG-2167:
----------------------------

    Attachment: PIG-2167.1.patch
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248145#comment-13248145 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello Everyone
I have added the proposal for cubing project at [GSoC Proposal|http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/pras_j/1]

Dmitriy has already reviewed it. Made it public for comments from everyone. 

Please let me know your comments/feedback, so that I can incorporate the changes before deadline. 


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269091#comment-13269091 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Uploading patch with Julien's review comments incorporated. 

Modifications:
1) Testing uses mock.Storage()
2) Chaining exceptions
3) Refactored iterative visits to a separate private method in logical optimizers
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235214#comment-13235214 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello everyone

I am Prasanth Jayachandran, graduate student at The Ohio State University. I am working with Prof. Arnab Nandi for providing CUBE operator support in Pig. I am submitting the initial version of the CUBE operator implementation(naive version of cube materialization). As this is my first patch submission I am really excited about it and am hoping to continue my contribution for Apache Pig. Please review the attached patch and provide feedback from improvising it.

Following contents explains the design decision and some initial performance numbers (experiments performed on single node pseudo-distributed hadoop setup). 
*Pig syntax for Cubing*
CUBE rel BY (a,b,c);

*SQL/Oracle syntax for Cubing*
GROUP BY CUBE (a,b,c);

*CUBE operator internals*
The CUBE operator injects the logical plan for following operators
x = FOREACH rel GENERATE FLATTEN(CubeDimensions(a,b,c));
y = GROUP x by (a,b,c);

*What is the output schema of CUBE operator?*
{group: tuple(a,b,c), cube: bag{(dimensions::a,dimensions::b,dimensions::c)}}

*Why syntactically different from SQL/Oracle?*
- Easier to implement as it does not modify or break the existing GROUP BY operator implementation
- CUBE operator might require separate flags for the following
	- Switching between BUC and STAR cubing (future optimization)
	- HAVING clause for monotonic operations
	- ROLLUP/DRILLDOWN operations 
	- Hint the location of partially computed CUBE
	- user specified inputs (example: algebraic attribute for converting the holistic measure to partially algebraic measure can be specified by user)
- Some operations applicable in GROUP operator are not applicable for CUBE 
	- Constant expression evaluation 
	- Duplicate column projection
- Follows Pig language design principle of procedural simplicity

*Corner case handling*
Constant expressions can be provided in GROUP BY operator. Constant expressions support has been removed from CUBE BY operator grammar. If constant expressions are used with CUBE BY, FrontEndException is thrown.
Duplicate column projection is supported in GROUP BY. Duplicates columns will be eliminated while generating logical plan. Current implementation ignores duplicates dimensions in CUBE BY. This can also be modified to throw exception if user repeats a dimension more than once. 
If cube dimensions are a subset of columns in input schema then the remaining columns in the input schema will be pushed to the “cube” bag.
For example: 
inp = LOAD ‘/pig/data/input’ AS (a,b,c,d);
x = CUBE inp BY (a,b);
schema of x will be {group:tuple(a,b), cube:bag(dimensions::a,dimensions::b,c,d)}

*Performance*
*Apache Pig Test Environment*
OS: Ubuntu 11.04 running as guest OS in Virtual Box
CPU Cores: 2 (4 Threads)
Memory: 8GB
HDD: 100GB 
Mode: Single node pseudo-distributed mode setup running Hadoop-0.20.2.
Configuration: Default configurations of hadoop 
*SQL Server 2008 R2 Test Environment*
OS: Windows 7
CPU Cores: 4 (8 Threads)
Memory: 16GB
HDD: 500GB
!Pig-Cubing-Performance.png!

*Acknowledgements*
Professor Arnab Nandi, Department of Computer Science and Engineering, The Ohio State University, Columbus for guidance and assistance throughout the course of this initial implementation.
Chaitanya Solarpurikar, Graduate Student, Department of Computer Science and Engineering, The Ohio State University, Columbus for setting up SQL server test environment and running performance comparison experiments.
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>         Attachments: Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth J updated PIG-2167:
----------------------------

    Attachment: Pig-Cubing-Performance.png
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>         Attachments: Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282163#comment-13282163 ] 

Russell Jurney commented on PIG-2167:
-------------------------------------

Playing with this now, applied it to 0.10. Thanks!
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Prasanth J
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2167:
----------------------------

    Labels: gsoc2012  (was: )
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264649#comment-13264649 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

Quite right, I was looking at a different RB. Sorry bout that. Reviewing :).
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282165#comment-13282165 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Russell.. Try using the patch in JIRA-2170. I uploaded a new one yesterday which fixes a exception with illustrate operator..
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Prasanth J
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248397#comment-13248397 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

Prasanth, could you post your initial patch (the one already attached to this jira) to review board?
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264670#comment-13264670 ] 

jiraposter@reviews.apache.org commented on PIG-2167:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/
-----------------------------------------------------------

(Updated 2012-04-30 04:05:03.466616)


Review request for pig and Dmitriy Ryaboy.


Changes
-------

Updated JIRA issue field to reflect just the issue number. 


Summary
-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167


This addresses bug PIG-2167.
    https://issues.apache.org/jira/browse/PIG-2167


Diffs
-----

  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION 

Diff: https://reviews.apache.org/r/4670/diff


Testing
-------

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit


Thanks,

Prasanth_J


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268819#comment-13268819 ] 

jiraposter@reviews.apache.org commented on PIG-2167:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7484
-----------------------------------------------------------


This review took a little time as I got interrupted in the middle. :)
Great job overall. 
Some comments are about refactoring a little part of the code your modifying.
Let me know what you think.


http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java
<https://reviews.apache.org/r/4670/#comment16557>

    iterate on entries() instead



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java
<https://reviews.apache.org/r/4670/#comment16561>

    All those for loops could be factored into a private visitAll(Collection<LogicalExpressionPlan>)



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java
<https://reviews.apache.org/r/4670/#comment16562>

    same comment.
    You can factor out the for loops in those methods.



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java
<https://reviews.apache.org/r/4670/#comment16563>

    chain all exceptions



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java
<https://reviews.apache.org/r/4670/#comment16564>

    same



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g
<https://reviews.apache.org/r/4670/#comment16566>

    do you know if there's a way to avoid duplication across .g files?



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java
<https://reviews.apache.org/r/4670/#comment16567>

    you can make this a javadoc comment



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java
<https://reviews.apache.org/r/4670/#comment16568>

    this is not affected to anything. Is it intended?



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java
<https://reviews.apache.org/r/4670/#comment16569>

    same comment. I guess it is intended, but it looks strange. If the constructor has side effect, maybe there should be another way



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java
<https://reviews.apache.org/r/4670/#comment16776>

    add the name of the duplicate in the error message



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g
<https://reviews.apache.org/r/4670/#comment16777>

    please add some comments to explain what you're doing here



http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java
<https://reviews.apache.org/r/4670/#comment16778>

    you van now use mock.Storage() for those tests.
    see PIG-2650



http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java
<https://reviews.apache.org/r/4670/#comment16779>

    Use the message part of the assert statement to ensure a good error message when it fails.
    
    Tuple t = it.next()
    assertTrue(expected+" contains "+t, expected.contains(t))


- Julien


On 2012-04-30 04:05:03, Prasanth_J wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4670/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-30 04:05:03)
bq.  
bq.  
bq.  Review request for pig and Dmitriy Ryaboy.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This is a review board for https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  This addresses bug PIG-2167.
bq.      https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/4670/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Unit tests: All passed
bq.  
bq.  Pre-commit tests: All passed
bq.  ant clean test-commit
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Prasanth_J
bq.  
bq.


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282180#comment-13282180 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Oops. My bad. Its PIG-2710 actually. 
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Prasanth J
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269089#comment-13269089 ] 

jiraposter@reviews.apache.org commented on PIG-2167:
----------------------------------------------------



bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > This review took a little time as I got interrupted in the middle. :)
bq.  > Great job overall. 
bq.  > Some comments are about refactoring a little part of the code your modifying.
bq.  > Let me know what you think.

Thanks for the review julien. I made patch v3 based on Dmitriy's review comments but was not able to upload in RB because git diff's are compatible with RB. I have incorporated most of your review comments in this patch (v4).
Please check my comments below. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java, line 98
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101421#file101421line98>
bq.  >
bq.  >     iterate on entries() instead

The MultiMap class under org.apache.pig.impl.util package doesn't seem to support iteration using entrySet. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java, lines 100-103
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101421#file101421line100>
bq.  >
bq.  >     All those for loops could be factored into a private visitAll(Collection<LogicalExpressionPlan>)

Good catch. Added new private method and refactored all occurrences of for loop to use this method. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java, lines 137-140
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101422#file101422line137>
bq.  >
bq.  >     same comment.
bq.  >     You can factor out the for loops in those methods.

Added new private method. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java, line 78
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101423#file101423line78>
bq.  >
bq.  >     chain all exceptions

Modified. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java, line 397
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101429#file101429line397>
bq.  >
bq.  >     you can make this a javadoc comment

is javadoc generated for private methods as well? Since this is a private method I kept it as a simple comment. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java, line 608
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101429#file101429line608>
bq.  >
bq.  >     add the name of the duplicate in the error message

Added.


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 48
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101438#file101438line48>
bq.  >
bq.  >     you van now use mock.Storage() for those tests.
bq.  >     see PIG-2650

Updated from patch v3 onwards. 


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java, line 139
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101438#file101438line139>
bq.  >
bq.  >     Use the message part of the assert statement to ensure a good error message when it fails.
bq.  >     
bq.  >     Tuple t = it.next()
bq.  >     assertTrue(expected+" contains "+t, expected.contains(t))

Updated.


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java, line 522
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101429#file101429line522>
bq.  >
bq.  >     same comment. I guess it is intended, but it looks strange. If the constructor has side effect, maybe there should be another way

This inserts a project expression to logical expression plan (lEplan in this case). I found this way in other files too. I could not find a better way to attach a project expression to a logical expression plan. Do you have any other thoughts/ways for doing this?


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g, line 469
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101430#file101430line469>
bq.  >
bq.  >     please add some comments to explain what you're doing here

Updated.


bq.  On 2012-05-05 00:09:48, Julien Le Dem wrote:
bq.  > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g, lines 259-273
bq.  > <https://reviews.apache.org/r/4670/diff/2/?file=101428#file101428line259>
bq.  >
bq.  >     do you know if there's a way to avoid duplication across .g files?

Since all these grammar files serves different purposes, I am not sure if it is possible to avoid duplicates. 


- Prasanth_J


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7484
-----------------------------------------------------------


On 2012-04-30 04:05:03, Prasanth_J wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4670/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-30 04:05:03)
bq.  
bq.  
bq.  Review request for pig and Dmitriy Ryaboy.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This is a review board for https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  This addresses bug PIG-2167.
bq.      https://issues.apache.org/jira/browse/PIG-2167
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 
bq.    http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/4670/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Unit tests: All passed
bq.  
bq.  Pre-commit tests: All passed
bq.  ant clean test-commit
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Prasanth_J
bq.  
bq.


                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240008#comment-13240008 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

I wouldn't put too much weight on timing local mode. You are likely spending most of the time in JobControl's sleep(5000). We'll see about posting our patch for that.

                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264644#comment-13264644 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Dmitriy,

I have updated v2 of the patch to the RB just after replying to your review comments.

Here is the link for the diff against the baseline
https://reviews.apache.org/r/4670/diff/2/

Here is the link for diff against Patch-1
https://reviews.apache.org/r/4670/diff/1-2/#index_header

Please let me know if I missed out something while updating. 
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasanth J updated PIG-2167:
----------------------------

    Attachment: PIG-2167.2.patch
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy reassigned PIG-2167:
--------------------------------------

    Assignee: Prasanth J  (was: Dmitriy V. Ryaboy)
    
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Prasanth J
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066023#comment-13066023 ] 

Dmitriy V. Ryaboy commented on PIG-2167:
----------------------------------------

I believe there is value to providing the naive solution and improving on it later, rather than trying to build the optimal plan from the get-go.

Initial (naive) implementation plan:

Add an optional "WITH CUBE" clause to the group operator.

In LogicalPlanBuilder, if "WITH CUBE" is present, insert operators equivalent to the following above the group operator:

{code}
relation = foreach relation generate
   FLATTEN(CubeDimensions(dim1, dim2, dim3))
     as (dim1, dim2, dim3),
   other_fields;
{code}

It may be desirable in some cases to group by a superset of dimensions one wants to cube on: group by dim1, dim2, dim3 with cube on (dim1, dim2). If we want to support that use case, we simply need to know to call the UDF on (dim1, dim2) and push dim3 into the other_fields list.

Note also that there's a bit of a problem if null values are legitimate values for the dimensions, as we use null to indicate "all". The UDF provided in PIG-2168 allows one to use custom strings instead of null for the "all" marker. We can optionally support this in the grammar, as well.

> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>             Fix For: 0.10
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2167) CUBE operation in Pig

Posted by "Prasanth J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248420#comment-13248420 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Posted in review board [https://reviews.apache.org/r/4670/]

Let me know in case if missed some details. 
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work up from there.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira