You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2011/01/06 23:38:46 UTC

[jira] Created: (PIG-1790) Need to document when combiner is used

Need to document when combiner is used
--------------------------------------

                 Key: PIG-1790
                 URL: https://issues.apache.org/jira/browse/PIG-1790
             Project: Pig
          Issue Type: Improvement
          Components: documentation
            Reporter: Olga Natkovich
            Assignee: Olga Natkovich
             Fix For: 0.9.0


I serached through the documentation but could not find a section that describes the cases under which combiner is used. Since combiner has such a significant impact on query performance, I think it is important to add this information. Also, with 0.9 we are expending combiner usage so having documentation would be useful for that as well.

Here are the JIRAs for combiner use slated for 0.9:

https://issues.apache.org/jira/browse/PIG-750
https://issues.apache.org/jira/browse/PIG-490
https://issues.apache.org/jira/browse/PIG-946
https://issues.apache.org/jira/browse/PIG-1735

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1790) Need to document when combiner is used

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1790:
--------------------------------

    Assignee: Corinne Chandel  (was: Olga Natkovich)

The following information should be added to the cookbook:

Make Sure Combiner is Used
 
Whenever possible make sure that combiner is used as it frequently yields order of magnitude improvement in performance. Combiner is generally used in case of non-nested foreach where all projections are either expressions on the group column or expressions on algebraic UDFs[LINK to definition]. 
 
Example:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate ABS(SUM(A.gpa)), COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2, group.age;
Explain C;
 
There are a number of things to note in this example:
 

Group can be referred to as a whole or by accessing individual fields as the case in this example. 
Group and its elements can appear anywhere in the projection 
A variety of expressions can be applied to algebraic functions including 

Column transformation function such as ABS applied to an algebraic function SUM 
An algebraic function (COUNT) can be applied to another algebraic function (Distinct) although only the inner function is computed using combiner 
Mathematical expression can be applied to one or more algebraic functions. 


You can check if the combiner is used for your query by running explain on the foreach alias as shown above. You should see the combine section in the Map Reduce part of the plan:
 
.....
Combine Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-42
|   |
|   Project[bytearray][0] - scope-43
|
|---C: New For Each(false,false,false)[bag] - scope-28
    |   |
    |   Project[bytearray][0] - scope-29
    |   |
    |  POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
    |   |
    |   |---Project[bag][1] - scope-31
    |   |
    |  POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] - scope-32
    |   |
    |   |---Project[bag][2] - scope-33
    |
    |---POCombinerPackage[tuple]{bytearray} - scope-36--------
.....
 
Combiner is also used with nested foreach as long as the only nested operation used is DISTINCT:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B {
D = distinct (A.name);
generate group, COUNT(D);}
 
Finally, use of combiner is influenced by the surrounding environment of the GROUP/FOREACH statements. 
 
Combiner is generally not used if there is any operator that comes between the GROUP and the FOREACH in the execution plan. Even if in your script they come next to each other, the optimizer might re-arrange them as the case with the example below:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = filter C by group.age <30;
 
In this case, filter will be pushed above foreach which will prevent the use of combiner. Please, note that this script can be made more efficient by performing filtering before the group:
 
A = load 'studenttab10k' as (name, age, gpa);
B = filter A by age <30;
C = group B by age;
D = foreach C generate group, COUNT (B);
 
One exception from the above rule is limit. Starting with Pig 0.9, even if limit comes between GROUP and FOREACH, the combiner will still be used:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = limit C 20;
 
In this example the optimizer will push the limit above foreach which will not disable combiner.
 
Combiner is also not used in the case where multiple foreach statements are associated with the same group:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
.....
 
Depending on your use case, it might be more efficient to split your script onto multiples.


> Need to document when combiner is used
> --------------------------------------
>
>                 Key: PIG-1790
>                 URL: https://issues.apache.org/jira/browse/PIG-1790
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.9.0
>
>
> I serached through the documentation but could not find a section that describes the cases under which combiner is used. Since combiner has such a significant impact on query performance, I think it is important to add this information. Also, with 0.9 we are expending combiner usage so having documentation would be useful for that as well.
> Here are the JIRAs for combiner use slated for 0.9:
> https://issues.apache.org/jira/browse/PIG-750
> https://issues.apache.org/jira/browse/PIG-490
> https://issues.apache.org/jira/browse/PIG-946
> https://issues.apache.org/jira/browse/PIG-1735

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-1790) Need to document when combiner is used

Posted by "Corinne Chandel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Corinne Chandel resolved PIG-1790.
----------------------------------

    Resolution: Fixed

Performance and Efficiency doc updated

New Combiner section added. 

Patch will be submitted under Pig-1772.

> Need to document when combiner is used
> --------------------------------------
>
>                 Key: PIG-1790
>                 URL: https://issues.apache.org/jira/browse/PIG-1790
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.9.0
>
>
> I serached through the documentation but could not find a section that describes the cases under which combiner is used. Since combiner has such a significant impact on query performance, I think it is important to add this information. Also, with 0.9 we are expending combiner usage so having documentation would be useful for that as well.
> Here are the JIRAs for combiner use slated for 0.9:
> https://issues.apache.org/jira/browse/PIG-750
> https://issues.apache.org/jira/browse/PIG-490
> https://issues.apache.org/jira/browse/PIG-946
> https://issues.apache.org/jira/browse/PIG-1735

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira