You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jeff Yuan <qu...@gmail.com> on 2013/03/14 21:31:32 UTC

Loader partitioning on field

I am writing a loader for a storage format, which partitions by a
particular field in the record. So I would like to implement something
which can push down filters on the partitioned field so that the
record reader does not need to read files that are outside the
filtered range. In the interface "LoadMetadata", the
"getPartitionKeys" and "setPartitionFilter" functions seem to support
what I need (where Pig should pass the filtering expression on the
declared partition keys to "setPartitionFilter", but I have a couple
of questions. I'm going to reference the following example, where
timestamp is the partition key.

a = load 'stored_data' using CustomLoader();
b = filter a by timestamp = CUSTOM_UDF(date, month);

1. Would partitioning work in this case where the partition key filter
includes a UDF?

2. Does the partition statement need to be directly after the load
statement? What I mean is, if I declare a variable c between a and b
which does some other operation on a, will Pig pass the filter
expression of b when loading a?

3. Can you point out roughly where this "setPartitionFilter" function
is called in Pig code during the load process? I couldn't seem to find
it through a search of the Pig source.

Thanks a lot!

Re: Loader partitioning on field

Posted by Jonathan Coveney <jc...@gmail.com>.
If it is being passed in anyway, you could make it a $PARAM that is set by
the launch script, and then it would be a constant in the script.


2013/3/14 Jeff Yuan <qu...@gmail.com>

> Well, I have a UDF called "SCHEDULED_TIME()" that returns the time
> when a pig query is scheduled to run in the system. This time is
> passed in by the system to pig when the job is launched. Since I
> partition files by time field, a user could filter based on the result
> of this UDF.
>
>
>
> On Thu, Mar 14, 2013 at 3:15 PM, Jonathan Coveney <jc...@gmail.com>
> wrote:
> > No, it is not. But if it knew that, how would that filter be meaningful?
> > What do you have in mind?
> >
> >
> > 2013/3/14 Jeff Yuan <qu...@gmail.com>
> >
> >> Rohini, I see your point.
> >>
> >> One followup question: it's possible for the result of a UDF to be
> >> constant and not dependent on the tuples of each record, right? Is Pig
> >> able to make such a determination in this case and push the pushdown
> >> such UDF results to load?
> >>
> >> Thanks,
> >> Jeff
> >>
> >> On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
> >> <ro...@gmail.com> wrote:
> >> > The filter push down to LoadFunc happens on the front end before the
> job
> >> > launch and the UDF is still not evaluated then. So you need to have
> >> > constants in your filter condition.
> >> >
> >> > Logical plan is internal to pig and will never be exposed. Refer
> >> > https://issues.apache.org/jira/browse/PIG-3199
> >> >
> >> > Regards,
> >> > Rohini
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <qu...@gmail.com>
> >> wrote:
> >> >
> >> >> Thanks! Regarding 1), where there is a UDF in the filter step on a
> >> >> partition field. The UDF is not first evaluated before and then the
> >> >> result passed to the load function?
> >> >>
> >> >> A separate question: In a LoadFunc, is there a way to get a reference
> >> >> to the logical query plan?
> >> >>
> >> >> Thanks again.
> >> >>
> >> >> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
> >> >> <ro...@gmail.com> wrote:
> >> >> > Jeff,
> >> >> >
> >> >> > 1) It should not. If it does push, then it is a bug in pig.
> >> >> >
> >> >> > 2) I think it should be fine.
> >> >> >
> >> >> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
> >> >> >
> >> >> > Regards,
> >> >> >
> >> >> > Rohini
> >> >> >
> >> >> >
> >> >> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <quaintenance@gmail.com
> >
> >> >> wrote:
> >> >> >
> >> >> >> I am writing a loader for a storage format, which partitions by a
> >> >> >> particular field in the record. So I would like to implement
> >> something
> >> >> >> which can push down filters on the partitioned field so that the
> >> >> >> record reader does not need to read files that are outside the
> >> >> >> filtered range. In the interface "LoadMetadata", the
> >> >> >> "getPartitionKeys" and "setPartitionFilter" functions seem to
> support
> >> >> >> what I need (where Pig should pass the filtering expression on the
> >> >> >> declared partition keys to "setPartitionFilter", but I have a
> couple
> >> >> >> of questions. I'm going to reference the following example, where
> >> >> >> timestamp is the partition key.
> >> >> >>
> >> >> >> a = load 'stored_data' using CustomLoader();
> >> >> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
> >> >> >>
> >> >> >> 1. Would partitioning work in this case where the partition key
> >> filter
> >> >> >> includes a UDF?
> >> >> >>
> >> >> >> 2. Does the partition statement need to be directly after the load
> >> >> >> statement? What I mean is, if I declare a variable c between a
> and b
> >> >> >> which does some other operation on a, will Pig pass the filter
> >> >> >> expression of b when loading a?
> >> >> >>
> >> >> >> 3. Can you point out roughly where this "setPartitionFilter"
> function
> >> >> >> is called in Pig code during the load process? I couldn't seem to
> >> find
> >> >> >> it through a search of the Pig source.
> >> >> >>
> >> >> >> Thanks a lot!
> >> >> >>
> >> >>
> >>
>

Re: Loader partitioning on field

Posted by Jeff Yuan <qu...@gmail.com>.
Well, I have a UDF called "SCHEDULED_TIME()" that returns the time
when a pig query is scheduled to run in the system. This time is
passed in by the system to pig when the job is launched. Since I
partition files by time field, a user could filter based on the result
of this UDF.



On Thu, Mar 14, 2013 at 3:15 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> No, it is not. But if it knew that, how would that filter be meaningful?
> What do you have in mind?
>
>
> 2013/3/14 Jeff Yuan <qu...@gmail.com>
>
>> Rohini, I see your point.
>>
>> One followup question: it's possible for the result of a UDF to be
>> constant and not dependent on the tuples of each record, right? Is Pig
>> able to make such a determination in this case and push the pushdown
>> such UDF results to load?
>>
>> Thanks,
>> Jeff
>>
>> On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
>> <ro...@gmail.com> wrote:
>> > The filter push down to LoadFunc happens on the front end before the job
>> > launch and the UDF is still not evaluated then. So you need to have
>> > constants in your filter condition.
>> >
>> > Logical plan is internal to pig and will never be exposed. Refer
>> > https://issues.apache.org/jira/browse/PIG-3199
>> >
>> > Regards,
>> > Rohini
>> >
>> >
>> > On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <qu...@gmail.com>
>> wrote:
>> >
>> >> Thanks! Regarding 1), where there is a UDF in the filter step on a
>> >> partition field. The UDF is not first evaluated before and then the
>> >> result passed to the load function?
>> >>
>> >> A separate question: In a LoadFunc, is there a way to get a reference
>> >> to the logical query plan?
>> >>
>> >> Thanks again.
>> >>
>> >> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
>> >> <ro...@gmail.com> wrote:
>> >> > Jeff,
>> >> >
>> >> > 1) It should not. If it does push, then it is a bug in pig.
>> >> >
>> >> > 2) I think it should be fine.
>> >> >
>> >> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
>> >> >
>> >> > Regards,
>> >> >
>> >> > Rohini
>> >> >
>> >> >
>> >> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> I am writing a loader for a storage format, which partitions by a
>> >> >> particular field in the record. So I would like to implement
>> something
>> >> >> which can push down filters on the partitioned field so that the
>> >> >> record reader does not need to read files that are outside the
>> >> >> filtered range. In the interface "LoadMetadata", the
>> >> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
>> >> >> what I need (where Pig should pass the filtering expression on the
>> >> >> declared partition keys to "setPartitionFilter", but I have a couple
>> >> >> of questions. I'm going to reference the following example, where
>> >> >> timestamp is the partition key.
>> >> >>
>> >> >> a = load 'stored_data' using CustomLoader();
>> >> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
>> >> >>
>> >> >> 1. Would partitioning work in this case where the partition key
>> filter
>> >> >> includes a UDF?
>> >> >>
>> >> >> 2. Does the partition statement need to be directly after the load
>> >> >> statement? What I mean is, if I declare a variable c between a and b
>> >> >> which does some other operation on a, will Pig pass the filter
>> >> >> expression of b when loading a?
>> >> >>
>> >> >> 3. Can you point out roughly where this "setPartitionFilter" function
>> >> >> is called in Pig code during the load process? I couldn't seem to
>> find
>> >> >> it through a search of the Pig source.
>> >> >>
>> >> >> Thanks a lot!
>> >> >>
>> >>
>>

Re: Loader partitioning on field

Posted by Jonathan Coveney <jc...@gmail.com>.
No, it is not. But if it knew that, how would that filter be meaningful?
What do you have in mind?


2013/3/14 Jeff Yuan <qu...@gmail.com>

> Rohini, I see your point.
>
> One followup question: it's possible for the result of a UDF to be
> constant and not dependent on the tuples of each record, right? Is Pig
> able to make such a determination in this case and push the pushdown
> such UDF results to load?
>
> Thanks,
> Jeff
>
> On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
> <ro...@gmail.com> wrote:
> > The filter push down to LoadFunc happens on the front end before the job
> > launch and the UDF is still not evaluated then. So you need to have
> > constants in your filter condition.
> >
> > Logical plan is internal to pig and will never be exposed. Refer
> > https://issues.apache.org/jira/browse/PIG-3199
> >
> > Regards,
> > Rohini
> >
> >
> > On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <qu...@gmail.com>
> wrote:
> >
> >> Thanks! Regarding 1), where there is a UDF in the filter step on a
> >> partition field. The UDF is not first evaluated before and then the
> >> result passed to the load function?
> >>
> >> A separate question: In a LoadFunc, is there a way to get a reference
> >> to the logical query plan?
> >>
> >> Thanks again.
> >>
> >> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
> >> <ro...@gmail.com> wrote:
> >> > Jeff,
> >> >
> >> > 1) It should not. If it does push, then it is a bug in pig.
> >> >
> >> > 2) I think it should be fine.
> >> >
> >> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
> >> >
> >> > Regards,
> >> >
> >> > Rohini
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com>
> >> wrote:
> >> >
> >> >> I am writing a loader for a storage format, which partitions by a
> >> >> particular field in the record. So I would like to implement
> something
> >> >> which can push down filters on the partitioned field so that the
> >> >> record reader does not need to read files that are outside the
> >> >> filtered range. In the interface "LoadMetadata", the
> >> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
> >> >> what I need (where Pig should pass the filtering expression on the
> >> >> declared partition keys to "setPartitionFilter", but I have a couple
> >> >> of questions. I'm going to reference the following example, where
> >> >> timestamp is the partition key.
> >> >>
> >> >> a = load 'stored_data' using CustomLoader();
> >> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
> >> >>
> >> >> 1. Would partitioning work in this case where the partition key
> filter
> >> >> includes a UDF?
> >> >>
> >> >> 2. Does the partition statement need to be directly after the load
> >> >> statement? What I mean is, if I declare a variable c between a and b
> >> >> which does some other operation on a, will Pig pass the filter
> >> >> expression of b when loading a?
> >> >>
> >> >> 3. Can you point out roughly where this "setPartitionFilter" function
> >> >> is called in Pig code during the load process? I couldn't seem to
> find
> >> >> it through a search of the Pig source.
> >> >>
> >> >> Thanks a lot!
> >> >>
> >>
>

Re: Loader partitioning on field

Posted by Jeff Yuan <qu...@gmail.com>.
Rohini, I see your point.

One followup question: it's possible for the result of a UDF to be
constant and not dependent on the tuples of each record, right? Is Pig
able to make such a determination in this case and push the pushdown
such UDF results to load?

Thanks,
Jeff

On Thu, Mar 14, 2013 at 2:30 PM, Rohini Palaniswamy
<ro...@gmail.com> wrote:
> The filter push down to LoadFunc happens on the front end before the job
> launch and the UDF is still not evaluated then. So you need to have
> constants in your filter condition.
>
> Logical plan is internal to pig and will never be exposed. Refer
> https://issues.apache.org/jira/browse/PIG-3199
>
> Regards,
> Rohini
>
>
> On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <qu...@gmail.com> wrote:
>
>> Thanks! Regarding 1), where there is a UDF in the filter step on a
>> partition field. The UDF is not first evaluated before and then the
>> result passed to the load function?
>>
>> A separate question: In a LoadFunc, is there a way to get a reference
>> to the logical query plan?
>>
>> Thanks again.
>>
>> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
>> <ro...@gmail.com> wrote:
>> > Jeff,
>> >
>> > 1) It should not. If it does push, then it is a bug in pig.
>> >
>> > 2) I think it should be fine.
>> >
>> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
>> >
>> > Regards,
>> >
>> > Rohini
>> >
>> >
>> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com>
>> wrote:
>> >
>> >> I am writing a loader for a storage format, which partitions by a
>> >> particular field in the record. So I would like to implement something
>> >> which can push down filters on the partitioned field so that the
>> >> record reader does not need to read files that are outside the
>> >> filtered range. In the interface "LoadMetadata", the
>> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
>> >> what I need (where Pig should pass the filtering expression on the
>> >> declared partition keys to "setPartitionFilter", but I have a couple
>> >> of questions. I'm going to reference the following example, where
>> >> timestamp is the partition key.
>> >>
>> >> a = load 'stored_data' using CustomLoader();
>> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
>> >>
>> >> 1. Would partitioning work in this case where the partition key filter
>> >> includes a UDF?
>> >>
>> >> 2. Does the partition statement need to be directly after the load
>> >> statement? What I mean is, if I declare a variable c between a and b
>> >> which does some other operation on a, will Pig pass the filter
>> >> expression of b when loading a?
>> >>
>> >> 3. Can you point out roughly where this "setPartitionFilter" function
>> >> is called in Pig code during the load process? I couldn't seem to find
>> >> it through a search of the Pig source.
>> >>
>> >> Thanks a lot!
>> >>
>>

Re: Loader partitioning on field

Posted by Rohini Palaniswamy <ro...@gmail.com>.
The filter push down to LoadFunc happens on the front end before the job
launch and the UDF is still not evaluated then. So you need to have
constants in your filter condition.

Logical plan is internal to pig and will never be exposed. Refer
https://issues.apache.org/jira/browse/PIG-3199

Regards,
Rohini


On Thu, Mar 14, 2013 at 2:00 PM, Jeff Yuan <qu...@gmail.com> wrote:

> Thanks! Regarding 1), where there is a UDF in the filter step on a
> partition field. The UDF is not first evaluated before and then the
> result passed to the load function?
>
> A separate question: In a LoadFunc, is there a way to get a reference
> to the logical query plan?
>
> Thanks again.
>
> On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
> <ro...@gmail.com> wrote:
> > Jeff,
> >
> > 1) It should not. If it does push, then it is a bug in pig.
> >
> > 2) I think it should be fine.
> >
> > 3) Look at PColFilterExtractor and PartitionFilterOptimizer
> >
> > Regards,
> >
> > Rohini
> >
> >
> > On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com>
> wrote:
> >
> >> I am writing a loader for a storage format, which partitions by a
> >> particular field in the record. So I would like to implement something
> >> which can push down filters on the partitioned field so that the
> >> record reader does not need to read files that are outside the
> >> filtered range. In the interface "LoadMetadata", the
> >> "getPartitionKeys" and "setPartitionFilter" functions seem to support
> >> what I need (where Pig should pass the filtering expression on the
> >> declared partition keys to "setPartitionFilter", but I have a couple
> >> of questions. I'm going to reference the following example, where
> >> timestamp is the partition key.
> >>
> >> a = load 'stored_data' using CustomLoader();
> >> b = filter a by timestamp = CUSTOM_UDF(date, month);
> >>
> >> 1. Would partitioning work in this case where the partition key filter
> >> includes a UDF?
> >>
> >> 2. Does the partition statement need to be directly after the load
> >> statement? What I mean is, if I declare a variable c between a and b
> >> which does some other operation on a, will Pig pass the filter
> >> expression of b when loading a?
> >>
> >> 3. Can you point out roughly where this "setPartitionFilter" function
> >> is called in Pig code during the load process? I couldn't seem to find
> >> it through a search of the Pig source.
> >>
> >> Thanks a lot!
> >>
>

Re: Loader partitioning on field

Posted by Jeff Yuan <qu...@gmail.com>.
Thanks! Regarding 1), where there is a UDF in the filter step on a
partition field. The UDF is not first evaluated before and then the
result passed to the load function?

A separate question: In a LoadFunc, is there a way to get a reference
to the logical query plan?

Thanks again.

On Thu, Mar 14, 2013 at 1:51 PM, Rohini Palaniswamy
<ro...@gmail.com> wrote:
> Jeff,
>
> 1) It should not. If it does push, then it is a bug in pig.
>
> 2) I think it should be fine.
>
> 3) Look at PColFilterExtractor and PartitionFilterOptimizer
>
> Regards,
>
> Rohini
>
>
> On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com> wrote:
>
>> I am writing a loader for a storage format, which partitions by a
>> particular field in the record. So I would like to implement something
>> which can push down filters on the partitioned field so that the
>> record reader does not need to read files that are outside the
>> filtered range. In the interface "LoadMetadata", the
>> "getPartitionKeys" and "setPartitionFilter" functions seem to support
>> what I need (where Pig should pass the filtering expression on the
>> declared partition keys to "setPartitionFilter", but I have a couple
>> of questions. I'm going to reference the following example, where
>> timestamp is the partition key.
>>
>> a = load 'stored_data' using CustomLoader();
>> b = filter a by timestamp = CUSTOM_UDF(date, month);
>>
>> 1. Would partitioning work in this case where the partition key filter
>> includes a UDF?
>>
>> 2. Does the partition statement need to be directly after the load
>> statement? What I mean is, if I declare a variable c between a and b
>> which does some other operation on a, will Pig pass the filter
>> expression of b when loading a?
>>
>> 3. Can you point out roughly where this "setPartitionFilter" function
>> is called in Pig code during the load process? I couldn't seem to find
>> it through a search of the Pig source.
>>
>> Thanks a lot!
>>

Re: Loader partitioning on field

Posted by Rohini Palaniswamy <ro...@gmail.com>.
Jeff,

1) It should not. If it does push, then it is a bug in pig.

2) I think it should be fine.

3) Look at PColFilterExtractor and PartitionFilterOptimizer

Regards,

Rohini


On Thu, Mar 14, 2013 at 1:31 PM, Jeff Yuan <qu...@gmail.com> wrote:

> I am writing a loader for a storage format, which partitions by a
> particular field in the record. So I would like to implement something
> which can push down filters on the partitioned field so that the
> record reader does not need to read files that are outside the
> filtered range. In the interface "LoadMetadata", the
> "getPartitionKeys" and "setPartitionFilter" functions seem to support
> what I need (where Pig should pass the filtering expression on the
> declared partition keys to "setPartitionFilter", but I have a couple
> of questions. I'm going to reference the following example, where
> timestamp is the partition key.
>
> a = load 'stored_data' using CustomLoader();
> b = filter a by timestamp = CUSTOM_UDF(date, month);
>
> 1. Would partitioning work in this case where the partition key filter
> includes a UDF?
>
> 2. Does the partition statement need to be directly after the load
> statement? What I mean is, if I declare a variable c between a and b
> which does some other operation on a, will Pig pass the filter
> expression of b when loading a?
>
> 3. Can you point out roughly where this "setPartitionFilter" function
> is called in Pig code during the load process? I couldn't seem to find
> it through a search of the Pig source.
>
> Thanks a lot!
>