You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2008/12/30 20:33:44 UTC

[jira] Updated: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach

     [ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-580:
-------------------------------

    Patch Info: [Patch Available]
       Summary: PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach   (was: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach )

The changes are mostly in CombinerOptimizer and an additional UDF - Distinct.java. In the combiner optimizer if the inner plan of the foreach following a group has a distinct feeding into a UDF (algebraic or not), that plan is considered to be "combineable" - it is marked as being of type ExprType.DISTINCT. If there is atleast one inner plan in the foreach of type -  ExprType.DISTINCT or ExprType.ALGEBRAIC and no inner plan of type ExprType.NOT_ALGEBRAIC then the optimizer considers the foreach to be combineable. If an inner plan is of type ExprType.Distinct, then the leaf of the plan is a UDF whose input comes from a PODistinct with a Project[bag][*] in between the two. to take the tuples from the PODistinct and supply it as a bag to the UDF. This part of the plan is modified as explained in the code comment below:
{noformat}
                // A PODistinct in the plan will always have
                // a Project[bag](*) as its successor.
                // We will replace it with a POUserFunc with "Distinct" as 
                // the underlying UDF. 
                // In the map and combine, we will make this POUserFunc
                // the leaf of the plan by removing other operators which
                // are descendants up to the leaf.
                // In the reduce we will keep descendants intact. Further
                // down in fixProjectAndInputs we will change the inputs to
                // this POUserFunc in the combine and reduce plans to be
                // just projections of the column "i"
{noformat}

The idea is that the "distinct" portion of this plan is combined and then the combined result is fed to the UDF. So in the map and combine phases only partial distinct outputs are computed and the UDF which takes the distinct output as input is not called. Then in the reduce, the "combined" distinct output is fed as input to the UDF. 

All these changes are independent of the changes on other non-distinct agg inner plans which may have Algebraic UDFs - in those plans, the existing modifications will remain.



> PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach 
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-580
>                 URL: https://issues.apache.org/jira/browse/PIG-580
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently Pig uses the combiner only when there is foreach following a group when the elements in the foreach generate have the following characteristics:
> 1) simple project of the "group" column
> 2) Algebraic UDF
> The above conditions exclude use of the combiner for distinct aggregates - the distinct operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic udf). So if the following foreach should also be combinable:
> {code}
> ..
> b = group a by $0;
> c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
> {code}
> The combiner optimizer should cause the distinct to be combined and the final combine output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.