You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Brian Johnson (JIRA)" <ji...@apache.org> on 2015/02/02 18:45:35 UTC

[jira] [Commented] (PIG-273) Need to optimize the ways splits are handled, both in the top level plan and in nested plans.

    [ https://issues.apache.org/jira/browse/PIG-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301514#comment-14301514 ] 

Brian Johnson commented on PIG-273:
-----------------------------------

There is a problem because a split (explicit or generated) currently kills the ability to do any optimized joins.

> Need to optimize the ways splits are handled, both in the top level plan and in nested plans.
> ---------------------------------------------------------------------------------------------
>
>                 Key: PIG-273
>                 URL: https://issues.apache.org/jira/browse/PIG-273
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
>             Fix For: 0.2.0
>
>
> Currently, in the new pipeline rework (see PIG-157), splits in the data flow are not handled efficiently.  
> In the top level plans splits cause all the output data to be written to hdfs and then reread by each leg of the split.  This forces both a read/write and a new map/reduce pass when it is not always necessary.  For example, consider:
> A = load 'myfile';
> split A into B  if $0 < 100, C if $0 >= 100;
> B1 = group B by $0;
> ...
> C1 = group B by $1;
> ...
> In this case A will be loaded, and then immediately stored again.  Then a plan will be executed that handles the B* part of the script, and then another executed that will handle the C* part of the script.
> In nested plans, each projection of the generate is computed separately, even if they share common steps in the plan.  For example:
> B = group A by $0;
> C= foreach B {
>     C1 = distinct $1;
>     C2 = filter C1 by $1 > 0;
>     generate group, COUNT(C1), COUNT(C2);
> }
> That will currently be executed with two nested plans, distinct->COUNT(C1) and distinct->filter->COUNT(C2).  The same distinct will be computed twice.  Ideally we would like to compute the distinct once and then split the output.
> I suspect that optimizing the inner plan is more important because there are more situations where this occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)