You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2009/10/06 00:32:31 UTC
[jira] Commented: (PIG-983) PERFORMANCE: multi-query optimization
on multiple group bys following a join or cogroup
[ https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762425#action_12762425 ]
Hadoop QA commented on PIG-983:
-------------------------------
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12421322/PIG-983.patch
against trunk revision 821101.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/console
This message is automatically generated.
> PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup
> ---------------------------------------------------------------------------------------
>
> Key: PIG-983
> URL: https://issues.apache.org/jira/browse/PIG-983
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Richard Ding
> Assignee: Richard Ding
> Attachments: PIG-983.patch
>
>
> The current multi-query optimizer works well with pig scripts like this one:
> {code}
> data = LOAD 'input' AS (a:chararray, b:int, c:int);
> A = GROUP data BY b;
> B = GROUP data BY c;
> C = FOREACH A GENERATE group, COUNT(data);
> D = FOREACH B GENERATE group, SUM(data.b);
> STORE C INTO 'output1';
> STORE D INTO 'output2';
> {code}
> In this case the original three Map-Reduce jobs are merged into one MR job by the optimizer.
> The current optimizer, however, won't reduce the number of MR jobs for the scripts in which multiple group bys follow a join or a cogroup, such as this one:
> {code}
> data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
> data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
> A = JOIN data1 BY a1, data2 BY a2;
> B = GROUP A BY data1::b1;
> C = GROUP B BY data2::c2;
> D = FOREACH B GENERATE group, COUNT(A);
> E = FOREACH C GENERATE group, SUM(A.data2::b2);
> STORE D INTO 'output1';
> STORE E INTO 'output2';
> {code}
> Three MR jobs are still needed to run this script.
> Multi-query optimizer should work with this kind of scripts by merging the group bys and reducing the overall MR jobs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.