You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/07/22 20:55:32 UTC

[jira] Created: (PIG-331) Combiner needs to be used in the types branch

Combiner needs to be used in the types branch
---------------------------------------------

                 Key: PIG-331
                 URL: https://issues.apache.org/jira/browse/PIG-331
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Alan Gates
             Fix For: types_branch


The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Status: Patch Available  (was: Open)

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Attachment:     (was: combiner.patch)

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Priority: Critical  (was: Major)

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Attachment: combiner.patch

A first pass at using the combiner.  This patch contains a couple of things:

1) Foreachs that have only simple projection and algebraic functions make use of the combiner.
2) Distincts were optimized to not carry the bags of data along.  There is no actual need for them to use the combiner because we're only passing they keys.  hadoop takes care of collecting the keys and not passing multiple instances from map to reduce.

Things this patch does not contain:

1) Foreachs that have a combination of algebraic and non-algebraic.  These should be do-able, with the non-algebraic functions just being replaced by simple projections of the appropriate fields in the combiner plan.

2) Foreachs that include inner plans.  This is in particular useful for inner plans that use distinct, as this is the way to emulate count(distinct x) from SQL, a fairly common operation.  Some inner plans (such as those containing filters) cannot be split.  This is a little more challenging because it requiring duplicating parts of the inner plan in the combiner and reducer and not duplicating other parts.

These two should be added later. 

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-331) Combiner needs to be used in the types branch

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617972#action_12617972 ] 

Santhosh Srinivasan commented on PIG-331:
-----------------------------------------

   - File: src/org/apache/pig/backend/local/executionengine/LocalJob.java

      * In the anonymous Iterator that is returned, I noticed that the public member 

variables are not initialized but they are used in the hasNext() method.

{code}
return new Iterator<Tuple>() {

+            Tuple   t;
+            boolean atEnd;
+            public boolean hasNext() {
+                if (atEnd)
+                    return false;
+                try {
+                    if (t == null)
{code}


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java

      * Minor Comment: the log message indicates a leaf while the checks are being made for 

the roots.

{code}
 if (reduceRoots.size() != 1) {
+            log.warn("Expected reduce to have single leaf");
+            return;
+        }

{code}

      * Minor Comment: The condition to check for the existence of at least one algebraic 

and no instances of non-algebraic types is probably better written using && instead of ||

Method: +    private List<ExprType> algebraic(


{code}
//current code
if (!atLeastOneAlgebraic || !noNonAlgebraics) return null;

//more readable code?

if (atLeastOneAlgebraic && noNonAlgebraics) return null;

{code}


   - File: src/org/apache/pig/impl/plan/OperatorPlan.java

      * Major Comment:

The new replace(oldNode, newNode) method is similar to the private replaceNode(src, 

replacement, dst, multiMap) method added as part of 

https://issues.apache.org/jira/browse/PIG-290. The difference lies in the method signature 

and the access specifier.


{code}
public void replace(E oldNode, E newNode) throws PlanException {
//Pig-290 change below
private boolean replaceNode(E src, E replacement, E dst, MultiMap<E, E> multiMap) 
{code}


   - File: src/org/apache/pig/impl/util/Pair.java

      * Minor Comment:
Do we need equals() and hashCode() methods to make the class usable in containers?

   - File: 

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserCo

mparisonFunc.java

      * Major Comment:

In the clone() method, the ComparisonFunc func is not cloned


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFu

nc.java

   * Major Comment:

In the clone method, the final two constructor arguments are in the wrong order. POUserFunc 

expects FuncSpec, EvalFunc while the clone method uses null, funcSpec.clone. This should not 

compile in the first place.

{code}
+        POUserFunc clone = new POUserFunc(new OperatorKey(mKey.scope, 
+            NodeIdGenerator.getGenerator().getNextNodeId(mKey.scope)),
+            requestedParallelism, null, funcSpec.clone());
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/BinaryEx

pressionOperator.java

   * Major Comment:

In the cloneHelper method, the operator's resultType is not set. In the corresponding 

cloneHelper method in UnaryExpressionOperator, the restultType is set. The resultType is 

also set in PhysicalOperator's cloneHelper method.

{code}
//in UnaryExpressionOperator.java's cloneHelper

+        resultType = op.getResultType();
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PODistin

ct.java

   * Minor Comment:

The clone method is missing in PODistinct. Is this due to the re-write of distinct in 

LocalRearrange?

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-331) Combiner needs to be used in the types branch

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617972#action_12617972 ] 

sms edited comment on PIG-331 at 7/29/08 2:55 PM:
------------------------------------------------------------------

   - File: src/org/apache/pig/backend/local/executionengine/LocalJob.java

      * In the anonymous Iterator that is returned, I noticed that the public member variables are not initialized but they are used in the hasNext() method.

{code}
return new Iterator<Tuple>() {

+            Tuple   t;
+            boolean atEnd;
+            public boolean hasNext() {
+                if (atEnd)
+                    return false;
+                try {
+                    if (t == null)
{code}


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java

      * Minor Comment: 

The log message indicates a leaf while the checks are being made for the roots.

{code}
 if (reduceRoots.size() != 1) {
+            log.warn("Expected reduce to have single leaf");
+            return;
+        }

{code}

      * Minor Comment: 

The condition to check for the existence of at least one algebraic and no instances of non-algebraic types is probably better written using && instead of || Method: +    private List<ExprType> algebraic(


{code}
//current code
if (!atLeastOneAlgebraic || !noNonAlgebraics) return null;

//more readable code?

if (atLeastOneAlgebraic && noNonAlgebraics) return null;

{code}


   - File: src/org/apache/pig/impl/plan/OperatorPlan.java

      * Major Comment:

The new replace(oldNode, newNode) method is similar to the private replaceNode(src, replacement, dst, multiMap) method added as part of  https://issues.apache.org/jira/browse/PIG-290. The difference lies in the method signature and the access specifier.


{code}
public void replace(E oldNode, E newNode) throws PlanException {
//Pig-290 change below
private boolean replaceNode(E src, E replacement, E dst, MultiMap<E, E> multiMap) 
{code}


   - File: src/org/apache/pig/impl/util/Pair.java

      * Minor Comment:

Do we need equals() and hashCode() methods to make the class usable in containers?

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserComparisonFunc.java

      * Major Comment:

In the clone() method, the ComparisonFunc func is not cloned


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java

   * Major Comment:

In the clone method, the final two constructor arguments are in the wrong order. POUserFunc expects FuncSpec, EvalFunc while the clone method uses null, funcSpec.clone. This should not compile in the first place.

{code}
+        POUserFunc clone = new POUserFunc(new OperatorKey(mKey.scope, 
+            NodeIdGenerator.getGenerator().getNextNodeId(mKey.scope)),
+            requestedParallelism, null, funcSpec.clone());
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/BinaryExpressionOperator.java

   * Major Comment:

In the cloneHelper method, the operator's resultType is not set. In the corresponding cloneHelper method in UnaryExpressionOperator, the restultType is set. The resultType is also set in PhysicalOperator's cloneHelper method.

{code}
//in UnaryExpressionOperator.java's cloneHelper

+        resultType = op.getResultType();
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PODistinct.java

   * Minor Comment:

The clone method is missing in PODistinct. Is this due to the re-write of distinct in LocalRearrange?

      was (Author: sms):
       - File: src/org/apache/pig/backend/local/executionengine/LocalJob.java

      * In the anonymous Iterator that is returned, I noticed that the public member 

variables are not initialized but they are used in the hasNext() method.

{code}
return new Iterator<Tuple>() {

+            Tuple   t;
+            boolean atEnd;
+            public boolean hasNext() {
+                if (atEnd)
+                    return false;
+                try {
+                    if (t == null)
{code}


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java

      * Minor Comment: the log message indicates a leaf while the checks are being made for 

the roots.

{code}
 if (reduceRoots.size() != 1) {
+            log.warn("Expected reduce to have single leaf");
+            return;
+        }

{code}

      * Minor Comment: The condition to check for the existence of at least one algebraic 

and no instances of non-algebraic types is probably better written using && instead of ||

Method: +    private List<ExprType> algebraic(


{code}
//current code
if (!atLeastOneAlgebraic || !noNonAlgebraics) return null;

//more readable code?

if (atLeastOneAlgebraic && noNonAlgebraics) return null;

{code}


   - File: src/org/apache/pig/impl/plan/OperatorPlan.java

      * Major Comment:

The new replace(oldNode, newNode) method is similar to the private replaceNode(src, 

replacement, dst, multiMap) method added as part of 

https://issues.apache.org/jira/browse/PIG-290. The difference lies in the method signature 

and the access specifier.


{code}
public void replace(E oldNode, E newNode) throws PlanException {
//Pig-290 change below
private boolean replaceNode(E src, E replacement, E dst, MultiMap<E, E> multiMap) 
{code}


   - File: src/org/apache/pig/impl/util/Pair.java

      * Minor Comment:
Do we need equals() and hashCode() methods to make the class usable in containers?

   - File: 

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserCo

mparisonFunc.java

      * Major Comment:

In the clone() method, the ComparisonFunc func is not cloned


   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFu

nc.java

   * Major Comment:

In the clone method, the final two constructor arguments are in the wrong order. POUserFunc 

expects FuncSpec, EvalFunc while the clone method uses null, funcSpec.clone. This should not 

compile in the first place.

{code}
+        POUserFunc clone = new POUserFunc(new OperatorKey(mKey.scope, 
+            NodeIdGenerator.getGenerator().getNextNodeId(mKey.scope)),
+            requestedParallelism, null, funcSpec.clone());
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/BinaryEx

pressionOperator.java

   * Major Comment:

In the cloneHelper method, the operator's resultType is not set. In the corresponding 

cloneHelper method in UnaryExpressionOperator, the restultType is set. The resultType is 

also set in PhysicalOperator's cloneHelper method.

{code}
//in UnaryExpressionOperator.java's cloneHelper

+        resultType = op.getResultType();
{code}

   - File: 

src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PODistin

ct.java

   * Minor Comment:

The clone method is missing in PODistinct. Is this due to the re-write of distinct in 

LocalRearrange?
  
> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-331) Combiner needs to be used in the types branch

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616902#action_12616902 ] 

Pi Song commented on PIG-331:
-----------------------------

I agree with the patch.

Filter can be done as Filter Pusher and we should even get a better result.

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-331:
---------------------------

    Attachment: combiner.patch

I attached the wrong file previously.  That was an earlier version of the patch.  Here's the right version.

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: combiner.patch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-331) Combiner needs to be used in the types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-331:
------------------------------

    Assignee: Alan Gates

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-331) Combiner needs to be used in the types branch

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615955#action_12615955 ] 

Pi Song commented on PIG-331:
-----------------------------

This has been blocking me from implementing OLAP operators.

> Combiner needs to be used in the types branch
> ---------------------------------------------
>
>                 Key: PIG-331
>                 URL: https://issues.apache.org/jira/browse/PIG-331
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>
> The initial implementation in the types branch does not make use of the combiner.  For performance, it needs to make use of the combiner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.