You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Prashant Kommireddi (Created) (JIRA)" <ji...@apache.org> on 2012/03/22 21:44:22 UTC

[jira] [Created] (PIG-2610) GC errors on using FILTER within nested FOREACH

GC errors on using FILTER within nested FOREACH
-----------------------------------------------

                 Key: PIG-2610
                 URL: https://issues.apache.org/jira/browse/PIG-2610
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.9.1
            Reporter: Prashant Kommireddi


User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 

{code}
raw = LOAD 'input' using MyCustomLoader();

searches = FOREACH raw GENERATE
               day, searchType,
               FLATTEN(impBag) AS (adType, clickCount)
           ;

groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
counts = FOREACH groupedSearches{
               type1 = FILTER searches BY adType == 'type1';
               type2 = FILTER searches BY adType == 'type2';
               GENERATE
                   FLATTEN(group) AS (day, searchType),
                   COUNT(searches) numSearches,
                   SUM(clickCount) AS clickCountPerSearchType,
                   SUM(type1.clickCount) AS type1ClickCount,
                   SUM(type2.clickCount) AS type2ClickCount;
       };
{code}

Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2610) GC errors on using FILTER within nested FOREACH

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238205#comment-13238205 ] 

Daniel Dai commented on PIG-2610:
---------------------------------

Yes, we shall open a Jira for the new rule. For now, you can try to manually optimize the script by moving filter before group and project necessary columns before group. The GC exception is not from bag but from POProject, my suspicion is hadoop shuffle/sorting use too much memory and there is no memory for Pig to turn around.
                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2610) GC errors on using FILTER within nested FOREACH

Posted by "Dmitriy V. Ryaboy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237368#comment-13237368 ] 

Dmitriy V. Ryaboy commented on PIG-2610:
----------------------------------------

Ok so the Jira I *meant* to ask to open on this wasn't about a GC error (just push the filter above the group), but about the fact that the optimizer can do this automatically, with a little bit of trickiness (the filters need to be turned into generates, and the counts into sums).
                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2610) GC errors on using FILTER within nested FOREACH

Posted by "Rohini (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236988#comment-13236988 ] 

Rohini commented on PIG-2610:
-----------------------------

Here is the stack trace

2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2610) GC errors on using FILTER within nested FOREACH

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236219#comment-13236219 ] 

Daniel Dai commented on PIG-2610:
---------------------------------

Can you post exception stack?
                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2610) GC errors on using FILTER within nested FOREACH

Posted by "Prashant Kommireddi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248136#comment-13248136 ] 

Prashant Kommireddi commented on PIG-2610:
------------------------------------------

How is this case different (from Pig Latin basics page)?

{code}
A = LOAD 'data' AS (url:chararray,outlink:chararray);

DUMP A;
(www.ccc.com,www.hjk.com)
(www.ddd.com,www.xyz.org)
(www.aaa.com,www.cvn.org)
(www.www.com,www.kpt.net)
(www.www.com,www.xyz.org)
(www.ddd.com,www.xyz.org)

B = GROUP A BY url;

DUMP B;
(www.aaa.com,{(www.aaa.com,www.cvn.org)})
(www.ccc.com,{(www.ccc.com,www.hjk.com)})
(www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
(www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})


X = FOREACH B {
        FA= FILTER A BY outlink == 'www.xyz.org';
        PA = FA.outlink;
        DA = DISTINCT PA;
        GENERATE group, COUNT(DA);
}

DUMP X;
(www.aaa.com,0)
(www.ccc.com,0)
(www.ddd.com,1)
(www.www.com,1)

{code}
                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira