You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Utkarsh Srivastava (JIRA)" <ji...@apache.org> on 2007/11/30 07:33:43 UTC
[jira] Issue Comment Edited: (PIG-7) Optimize execution of algebraic functions

    [ https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547012 ] 

utkarsh edited comment on PIG-7 at 11/29/07 10:32 PM:
-----------------------------------------------------------------

Unfortunately, this patch has problems. It can kick off the combiner even in situations where it is not applicable. Following is the test sequence I tried:

finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ cat b
a       1
a       2
b       3
b       4
finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ java -cp pig.jar org.apache.pig.Main -
Connecting to hadoop file system at: localhost:9000
Connecting to map-reduce job tracker at: localhost:9001
grunt> a = load 'file:b';
grunt> b = group a by $0;
grunt> c = foreach b generate group, a; 
grunt> dump c;


As you will see below, the combiner is kicked off while it shouldn't be, and then the job fails.




----- MapReduce Job -----
Input: [/tmp/temp-1447320079/tmp-1892534978:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[PROJECT $0],[*]}]
Combine: GENERATE {[PROJECT $0],[PROJECT $1]}
Reduce: GENERATE {[PROJECT $0],[PROJECT $1]}
Output: /tmp/temp-1447320079/tmp840894904:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 476828
Pig progress = 0%
Pig progress = 50%
Error message from task (map) tip_200711292202_0003_m_000000
Error message from task (reduce) tip_200711292202_0003_r_000000 java.io.IOException: Unexpected data while reading tuple from binary file
        at org.apache.pig.data.Tuple.readFields(Tuple.java:294)
        at org.apache.pig.data.DataBag.read(DataBag.java:251)
        at org.apache.pig.data.Tuple.readDatum(Tuple.java:322)
        at org.apache.pig.data.Tuple.read(Tuple.java:308)
        at org.apache.pig.data.Tuple.readFields(Tuple.java:295)
        at org.apache.pig.data.IndexedTuple.readFields(IndexedTuple.java:52)
        at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:210)
        at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.<init>(ReduceTask.java:160)
        at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.<init>(ReduceTask.java:228)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:320)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
 java.io.IOException: Unexpected data while reading tuple from binary file

I think the problem is that ProjectSpec unconditionally returns true for amenableToCombiner() while in the above example, it is not amenable.


Another, much smaller problem is that  in visitSortDistinct() method, the sortSpec can be null (if the operator is carrying out a distinct), and that throws a NullPointerException (EvalSpecVisitor.java:62)


Another problem (though not strictly required to be solved in the first version), is that the combiner is kicked off in very restricted situations. 
The condition in MapreducePlanCompiler.java is

if (mro.toReduce == null && spec.amenableToCombiner() &&
                    spec instanceof GenerateSpec &&
                    mro.groupFuncs != null && mro.groupFuncs.size() == 1) {

But, in most cases, the users will follow up GENERATE of SUM, AVG etc. by filter, or some other foreach etc. In these cases spec will be an instance of CompositeEvalSpec with the first thing as a GenerateSpec, and the combiner won't fire. It will be as easy to replace by a more general condition

spec instanceof GenerateSpec || (spec instanceof CompositeEvalSpec && ((CompositeEvalSpec)spec).getSpecs().get(0) instanceof GenerateSpec




      was (Author: utkarsh):
    Unfortunately, this patch has problems. It can kick off the combiner even in situations where it is not applicable. Following is the test sequence I tried:

finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ java -cp pig.jar org.apache.pig.Main -
Connecting to hadoop file system at: localhost:9000
Connecting to map-reduce job tracker at: localhost:9001
grunt> a = load 'file:b';
grunt> b = group a by $0;
grunt> c = foreach b generate group, a; 
grunt> dump c;


As you will see below, the combiner is kicked off while it shouldn't be, and then the job fails.




----- MapReduce Job -----
Input: [/tmp/temp-1447320079/tmp-1892534978:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[PROJECT $0],[*]}]
Combine: GENERATE {[PROJECT $0],[PROJECT $1]}
Reduce: GENERATE {[PROJECT $0],[PROJECT $1]}
Output: /tmp/temp-1447320079/tmp840894904:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 476828
Pig progress = 0%
Pig progress = 50%
Error message from task (map) tip_200711292202_0003_m_000000
Error message from task (reduce) tip_200711292202_0003_r_000000 java.io.IOException: Unexpected data while reading tuple from binary file
        at org.apache.pig.data.Tuple.readFields(Tuple.java:294)
        at org.apache.pig.data.DataBag.read(DataBag.java:251)
        at org.apache.pig.data.Tuple.readDatum(Tuple.java:322)
        at org.apache.pig.data.Tuple.read(Tuple.java:308)
        at org.apache.pig.data.Tuple.readFields(Tuple.java:295)
        at org.apache.pig.data.IndexedTuple.readFields(IndexedTuple.java:52)
        at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:210)
        at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.<init>(ReduceTask.java:160)
        at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.<init>(ReduceTask.java:228)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:320)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
 java.io.IOException: Unexpected data while reading tuple from binary file

I think the problem is that ProjectSpec unconditionally returns true for amenableToCombiner() while in the above example, it is not amenable.


Another, much smaller problem is that  in visitSortDistinct() method, the sortSpec can be null (if the operator is carrying out a distinct), and that throws a NullPointerException (EvalSpecVisitor.java:62)


Another problem (though not strictly required to be solved in the first version), is that the combiner is kicked off in very restricted situations. 
The condition in MapreducePlanCompiler.java is

if (mro.toReduce == null && spec.amenableToCombiner() &&
                    spec instanceof GenerateSpec &&
                    mro.groupFuncs != null && mro.groupFuncs.size() == 1) {

But, in most cases, the users will follow up GENERATE of SUM, AVG etc. by filter, or some other foreach etc. In these cases spec will be an instance of CompositeEvalSpec with the first thing as a GenerateSpec, and the combiner won't fire. It will be as easy to replace by a more general condition

spec instanceof GenerateSpec || (spec instanceof CompositeEvalSpec && ((CompositeEvalSpec)spec).getSpecs().get(0) instanceof GenerateSpec



  
> Optimize execution of algebraic functions
> -----------------------------------------
>
>                 Key: PIG-7
>                 URL: https://issues.apache.org/jira/browse/PIG-7
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: combiner.patch
>
>
> Algebraic are functions that can be computed incrementally like count(X), SUM(X), etc. They can be computed effciently by doing the first level computation using hadoop combiner. This can give a significant (2-3x) speedup for many aggregation queries. 
> Several users asked us for this feature so it is pretty high priority.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.