You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2013/08/01 06:17:49 UTC

[jira] [Updated] (PIG-3395) Large filter expression makes Pig hang

     [ https://issues.apache.org/jira/browse/PIG-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-3395:
-------------------------------

    Attachment: PIG-3395-2.patch

Added new test cases to confirm that the filter doesn't get pushed down if udf/cast/null expressions are mixed with or/and expressions.

ReviewBoard:
https://reviews.apache.org/r/13186/
                
> Large filter expression makes Pig hang
> --------------------------------------
>
>                 Key: PIG-3395
>                 URL: https://issues.apache.org/jira/browse/PIG-3395
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: PIG-3395-2.patch, PIG-3395.patch, thread_dump.txt
>
>
> Currently, partition filter push down is quite costly. For example, if you have many nested or/and expressions, Pig hangs:
> {code}
> base = load '<partitioned table>' using MyStorage();
> filt = filter base by
> (dateint == 20130719 and batchid == 'merged_1' and hour IN (19,20,21,22,23))
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8))
> or
> (dateint == 20130720 and batchid == 'merged_2' and hour == 7)
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN (9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130721 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130722 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16));
> dump filt;
> {code}
> Note that IN operator is converted to nested OR's by Pig parser.
> Looking at the thread dump, I found it creates almost 60 stack frames and makes JVM suffer. (I will attach full stack trace.)
> {code}
> <repeated ...>
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:237)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:214)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:211)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:108)
> {code}
> Although the filter expression can be simplified, it seems possible to make PColFilterExtractor more efficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira