You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2010/06/03 01:18:09 UTC

algebraic optimization not invoked for filter following group?

It looks like right now, the combiner optimization does not kick in for a
script like this:

data = load 'foo' using PigStorage() as (a, b, c);
grouped = group data by a;
filtered = filter grouped by COUNT(data) < 1000;

Looking at the code in CombinerOptimizer, seems like the Filter bit is just
pseudo-coded in comments. Are there complications there other than what is
already noted, or is it just the matter of coding up the pseudo-code?

On that note -- assuming the optimization was implemented for Filter
following group, would it automagically start working for Splits, as well?

-D

Re: the last job in the mapreduce plan

Posted by Ashutosh Chauhan <as...@gmail.com>.
Gang,

What you are saying can never happen because we create a new MR
operator only when we have a blocking operator which needs to go in
the next MR operator. We dont create new MR operator apriori without
looking at next physical operator in the pipeline. If you are seeing
this happening, I would consider that as a bug.

Ashutosh

On Tue, Jun 15, 2010 at 09:26, Alan Gates <ga...@yahoo-inc.com> wrote:
> I've never seen a case where this happens.  Is this a theoretical question
> or are you seeing this issue?
>
> Alan.
>
> On Jun 15, 2010, at 8:49 AM, Gang Luo wrote:
>
>> Hi,
>> Is it possible the last MapReduce job in the MR plan only loads something
>> and stores it without any other processing in between? For example, when
>> visiting some physical operator, we need to end the current MR operator
>> after embedding the physical operator into MR operator, and create a new MR
>> operator for later physical operators. Unfortunately, the following physical
>> operator is a store, the end of the entire query. In this case, the last MR
>> operator only contain load and store without any meaningful work in between.
>> This idle MapReduce job will degrade the performance. Will this happen in
>> Pig?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>
>

Re: the last job in the mapreduce plan

Posted by Alan Gates <ga...@yahoo-inc.com>.
I've never seen a case where this happens.  Is this a theoretical  
question or are you seeing this issue?

Alan.

On Jun 15, 2010, at 8:49 AM, Gang Luo wrote:

> Hi,
> Is it possible the last MapReduce job in the MR plan only loads  
> something and stores it without any other processing in between? For  
> example, when visiting some physical operator, we need to end the  
> current MR operator after embedding the physical operator into MR  
> operator, and create a new MR operator for later physical operators.  
> Unfortunately, the following physical operator is a store, the end  
> of the entire query. In this case, the last MR operator only contain  
> load and store without any meaningful work in between. This idle  
> MapReduce job will degrade the performance. Will this happen in Pig?
>
> Thanks,
> -Gang
>
>
>
>


the last job in the mapreduce plan

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi,
Is it possible the last MapReduce job in the MR plan only loads something and stores it without any other processing in between? For example, when visiting some physical operator, we need to end the current MR operator after embedding the physical operator into MR operator, and create a new MR operator for later physical operators. Unfortunately, the following physical operator is a store, the end of the entire query. In this case, the last MR operator only contain load and store without any meaningful work in between. This idle MapReduce job will degrade the performance. Will this happen in Pig?

Thanks,
-Gang



      

Re: algebraic optimization not invoked for filter following group?

Posted by Alan Gates <ga...@yahoo-inc.com>.
For at least simple cases what's in the pseduo code should work.  I  
hope someday soon we can start using the new logical optimizer work  
(in the experimental package) to build rules for the MR optimizer  
(like this combiner stuff) as well, which should be much easier to  
code.  But it will be a while before we get there.

I don't think this will automatically make it work for split, because  
I think it will see the split in the plan and that will make it choose  
not to optimize.

Alan.

On Jun 2, 2010, at 4:18 PM, Dmitriy Ryaboy wrote:

> It looks like right now, the combiner optimization does not kick in  
> for a
> script like this:
>
> data = load 'foo' using PigStorage() as (a, b, c);
> grouped = group data by a;
> filtered = filter grouped by COUNT(data) < 1000;
>
> Looking at the code in CombinerOptimizer, seems like the Filter bit  
> is just
> pseudo-coded in comments. Are there complications there other than  
> what is
> already noted, or is it just the matter of coding up the pseudo-code?
>
> On that note -- assuming the optimization was implemented for Filter
> following group, would it automagically start working for Splits, as  
> well?
>
> -D