You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Yu Xu (JIRA)" <ji...@apache.org> on 2012/06/11 20:16:42 UTC

[jira] [Created] (PIG-2747) Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

Yu Xu created PIG-2747:
--------------------------

             Summary: Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source
                 Key: PIG-2747
                 URL: https://issues.apache.org/jira/browse/PIG-2747
             Project: Pig
          Issue Type: Improvement
            Reporter: Yu Xu
            Priority: Minor


consider the following example:

T = load ... ;
T1 = filter T by col == 'hello';
T2 = filter T by col =='world';

currently Pig optimizer does not combine the two predicates and cannot push down the predicates to the data sources (via LoadMetadata).  Thus the data source cannot do any filtering. A full table/file scan is required.

A current more efficient workaround (by hand) is to rewrite the above script to the following equivalent one:

T = load ...;
T = filter T by col == 'hello' or col == 'world' ;
T1 = filter T by col == 'hello';
T2 = filter T by col == 'world';

the above script enables Pig to push down the predicate (col == 'hello' or col == 'world') to the data source to use available partitions/indexes for potentially much more efficient processing. 

This JIRA is created to request PIG optimizer to perform the above type of optimization automatically. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (PIG-2747) Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

Posted by Daniel Dai <da...@hortonworks.com>.

Hi, Dmitriy,
Can you give the script you are thinking of?

On Sat, Jun 16, 2012 at 8:43 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I don't think a union is required for this to make sense.
>
> On Jun 11, 2012, at 11:58 AM, "Daniel Dai (JIRA)" <ji...@apache.org> wrote:
>
> >
> >    [
> https://issues.apache.org/jira/browse/PIG-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292983#comment-13292983]
> >
> > Daniel Dai commented on PIG-2747:
> > ---------------------------------
> >
> > My understanding is there is a union after T1, T2, right?
> >
> > Yes we only merge the consecutive filter into "and" condition. We don't
> merge "or" condition. So you want
> >
> > filter cond1, filter cond2 -> union ==> filter cond1 or cond2
> >
> >> Support more predicate pushdown to a data source by pulling up multiple
> predicates from branches using the same data source
> >>
> ---------------------------------------------------------------------------------------------------------------------------
> >>
> >>                Key: PIG-2747
> >>                URL: https://issues.apache.org/jira/browse/PIG-2747
> >>            Project: Pig
> >>         Issue Type: Improvement
> >>           Reporter: Yu Xu
> >>           Priority: Minor
> >>
> >> consider the following example:
> >> T = load ... ;
> >> T1 = filter T by col == 'hello';
> >> T2 = filter T by col =='world';
> >> currently Pig optimizer does not combine the two predicates and cannot
> push down the predicates to the data sources (via LoadMetadata).  Thus the
> data source cannot do any filtering. A full table/file scan is required.
> >> A current more efficient workaround (by hand) is to rewrite the above
> script to the following equivalent one:
> >> T = load ...;
> >> T = filter T by col == 'hello' or col == 'world' ;
> >> T1 = filter T by col == 'hello';
> >> T2 = filter T by col == 'world';
> >> the above script enables Pig to push down the predicate (col == 'hello'
> or col == 'world') to the data source to use available partitions/indexes
> for potentially much more efficient processing.
> >> This JIRA is created to request PIG optimizer to perform the above type
> of optimization automatically.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
>

Re: [jira] [Commented] (PIG-2747) Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I don't think a union is required for this to make sense. 

On Jun 11, 2012, at 11:58 AM, "Daniel Dai (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/PIG-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292983#comment-13292983 ] 
> 
> Daniel Dai commented on PIG-2747:
> ---------------------------------
> 
> My understanding is there is a union after T1, T2, right?
> 
> Yes we only merge the consecutive filter into "and" condition. We don't merge "or" condition. So you want
> 
> filter cond1, filter cond2 -> union ==> filter cond1 or cond2
> 
>> Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source
>> ---------------------------------------------------------------------------------------------------------------------------
>> 
>>                Key: PIG-2747
>>                URL: https://issues.apache.org/jira/browse/PIG-2747
>>            Project: Pig
>>         Issue Type: Improvement
>>           Reporter: Yu Xu
>>           Priority: Minor
>> 
>> consider the following example:
>> T = load ... ;
>> T1 = filter T by col == 'hello';
>> T2 = filter T by col =='world';
>> currently Pig optimizer does not combine the two predicates and cannot push down the predicates to the data sources (via LoadMetadata).  Thus the data source cannot do any filtering. A full table/file scan is required.
>> A current more efficient workaround (by hand) is to rewrite the above script to the following equivalent one:
>> T = load ...;
>> T = filter T by col == 'hello' or col == 'world' ;
>> T1 = filter T by col == 'hello';
>> T2 = filter T by col == 'world';
>> the above script enables Pig to push down the predicate (col == 'hello' or col == 'world') to the data source to use available partitions/indexes for potentially much more efficient processing. 
>> This JIRA is created to request PIG optimizer to perform the above type of optimization automatically. 
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>

[jira] [Commented] (PIG-2747) Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292983#comment-13292983 ] 

Daniel Dai commented on PIG-2747:
---------------------------------

My understanding is there is a union after T1, T2, right?

Yes we only merge the consecutive filter into "and" condition. We don't merge "or" condition. So you want

filter cond1, filter cond2 -> union ==> filter cond1 or cond2
                
> Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2747
>                 URL: https://issues.apache.org/jira/browse/PIG-2747
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Yu Xu
>            Priority: Minor
>
> consider the following example:
> T = load ... ;
> T1 = filter T by col == 'hello';
> T2 = filter T by col =='world';
> currently Pig optimizer does not combine the two predicates and cannot push down the predicates to the data sources (via LoadMetadata).  Thus the data source cannot do any filtering. A full table/file scan is required.
> A current more efficient workaround (by hand) is to rewrite the above script to the following equivalent one:
> T = load ...;
> T = filter T by col == 'hello' or col == 'world' ;
> T1 = filter T by col == 'hello';
> T2 = filter T by col == 'world';
> the above script enables Pig to push down the predicate (col == 'hello' or col == 'world') to the data source to use available partitions/indexes for potentially much more efficient processing. 
> This JIRA is created to request PIG optimizer to perform the above type of optimization automatically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2747) Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

Posted by "Yu Xu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293105#comment-13293105 ] 

Yu Xu commented on PIG-2747:
----------------------------

yes. that's use case. Thanks.
                
> Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2747
>                 URL: https://issues.apache.org/jira/browse/PIG-2747
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Yu Xu
>            Priority: Minor
>
> consider the following example:
> T = load ... ;
> T1 = filter T by col == 'hello';
> T2 = filter T by col =='world';
> currently Pig optimizer does not combine the two predicates and cannot push down the predicates to the data sources (via LoadMetadata).  Thus the data source cannot do any filtering. A full table/file scan is required.
> A current more efficient workaround (by hand) is to rewrite the above script to the following equivalent one:
> T = load ...;
> T = filter T by col == 'hello' or col == 'world' ;
> T1 = filter T by col == 'hello';
> T2 = filter T by col == 'world';
> the above script enables Pig to push down the predicate (col == 'hello' or col == 'world') to the data source to use available partitions/indexes for potentially much more efficient processing. 
> This JIRA is created to request PIG optimizer to perform the above type of optimization automatically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira