You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/02/06 23:31:08 UTC

[jira] Created: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
--------------------------------------------------------------------------------------------------------------------------

                 Key: PIG-97
                 URL: https://issues.apache.org/jira/browse/PIG-97
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.0.0
            Reporter: Alan Gates
            Assignee: Alan Gates


The following script will produce 0 output records, even when it should produce records:

a = load 'file1';
b = load 'file2';
c = cogroup a by $0, b by $0;
d = foreach c generate $0, COUNT($1), COUNT($2);
dump d;

In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:

for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
    if (t.getBagField(1 + i).size() == 0) return;
}

In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.

The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567828#action_12567828 ] 

Olga Natkovich commented on PIG-97:
-----------------------------------

+1

Antonio, could you also post your comments if any, thanks.

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Posted by "Antonio Magnaghi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567829#action_12567829 ] 

Antonio Magnaghi commented on PIG-97:
-------------------------------------

The patch looks good for the POVisitor pattern. I just have comment/question:

it looks like the POPrinter is currently not overriding some of the visitX methods in the super class (POVisitor), such as visitCogroup, visitSplit, visitUnion. If the local physical query plan contains such operators no info would be printed out for those operators

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-97:
--------------------------

    Attachment: cogroupcombiner.patch

The attached patch turns off combiner in the case of cogrouping being used.  It also restores the POVisitor to work the way it did before the front-end back-end split was introduced (PIG-32).  I needed this to make explain work again, so I could see when the combiner was and wasn't being invoked.

Antonio, please take a look at my changes for the POVisitor and make sure it will work within the new split framework.

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-97:
--------------------------

    Patch Info: [Patch Available]

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-97) Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-97.
---------------------------

       Resolution: Fixed
    Fix Version/s: 0.1.0

Fix checked in as revision 620665.

> Jobs produce wrong results when a cogroup is in the script and the compiler chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.1.0
>
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only one of the two files and thus there is nothing in one of the bags, so the above lines of code cause the combiner to bailout without pushing any tuples to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.